Using your code from Part 1 as a template, write a program
runSQL
that executes a given SQL statement on a cluster of
computers each running an instance of a DBMS. The input to
runSQL
consists of two filenames (stored in variables
clustercfg
and sqlfile
) passed in as commandline arguments.
The file clustercfg
contains access information for the
catalog DB. The file sqlfile
contains the SQL terminated by
a semi-colon to be executed. The runSQL
program will execute
the same SQL on the database instance of each of the
computers on the cluster (that holds data fragments for the
table) concurrently using threads. One thread should be
spawned for each computer in the cluster.
The runSQL
programm should output the rows retrieved to
the standard output on success or report failure.
You may assumed that the SQL queries only operate on single tables and do not contain any nested subqueries.
You should consider using the ANTLR compiler compiler to generate a SQL parser that you can use to extract the table name.
You may test your program on a single computer by using different databases to simulate the multiple computers.
clustercfg
file
catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd
sqlfile
fileSELECT *
FROM Books;
./run3.sh ./cluster.cfg ./books.sql
123323232 Database Systems Ramakrishnan
234323423 Operating Systems Silberstein
[jdbc:db2://10.0.0.3:50001/mydb2]: ./books.sql success.
[jdbc:db2://10.0.0.3:50001/mydb1]: ./books.sql failed.
Write a program loadCSV
that loads data from a
comma-separated (CSV) file into a distributed table on the
cluster. The program takes two commandline arguments
clustercfg
and csvfile
. The clustercfg
file contains access
information for the catalog DB, the name of the table to be
loaded, and the partitioning information. The csvfile
contains the data to be loaded. The catalog should be
consulted for access information for the nodes in the
cluster. Your program should also update the catalog with
the partitioning information. The loader does NOT need to be
multi-threaded. You should use a library for parsing CSV
instead of writing your own from scratch.
dtables(tname char(32),
nodedriver char(64),
nodeurl char(128),
nodeuser char(16),
nodepasswd char(16),
partmtd int,
nodeid int,
partcol char(32),
partparam1 char(32),
partparam2 char(32))
where
tname
is the table name involved in the DDL operation.nodeid
is the node number associated with this node.nodedriver
is the driver used to connect to the node in the
cluster for this entrynodeurl
is the JDBC URL of the node in the cluster for this
entrynodeuser
and nodepasswd
are the userid and password of the
DBMS instance at the node in the cluster for this entrypartmtd
is the partition method used to partition the data
in the tablepartcol
is the name of the column used by the partition method to
partition the data in the tablepartparam1
and partparam2
are parameters associated with the
particular partition methodThe partmtd
should take the following values:
Note that you will need to convert the partition method
specified as a string in the config file to the integer
partition method stored in dtables. The values in the config
file corresponding partmtd 0, 1, and 2 are notpartition
,
range
, and hash
respectively.
If the partition method is zero, ie, not partition, then the entire CSV file is inserted into the table at every node.
For range partitioning the rows that should be inserted into
partition X should have a value in theipartcol
between the
minimum and maximum of the range for X:
partparam1 < partcol <= partparam2.
At the boundary ranges, partparam{1,2} may take the special
values : -inf, +inf
.
For hash partitioning the rows that should be inserted into partition X if
X = ( partcol mod partparam1 ) + 1.
The plus one is to handle the fact that our partition/node
numbers start from 1 instead of 0. You may assume that only
numeric columns will be partitioned for this assignment.
The number of nodes in the dtables
relation and the number
of partitions in the config file should match. If not the
program should return an error message and exit. You may
assume the CSV file is error free (ie every row has the same
number of columns of the right type).
catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd
tablename=books
partition.method=range
partition.column=age
numnodes=2
partition.node1.param1=1
partition.node1.param2=10
partition.node2.param1=10
partition.node2.param2=20
catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd
tablename=books
partition.method=hash
partition.column=age
partition.param1=2
123323232,"Database Systems","Ramakrishnan,Raghu"
234323423,"Operating Systems","Silberstein, Adam"
./run4.sh ./cluster.cfg ./books.csv
[jdbc:db2://10.0.0.3:50001/mydb2]: 10 rows inserted.
[jdbc:db2://10.0.0.3:50001/mydb1]: 0 rows inserted.
[jdbc:db2://10.0.0.3:50001/mycatdb]: catalog updated.
Convert your sequential Distributed Loader into a multi-threaded one. Note that you will need to use synchronization or locking mechanisms (see java synchronization). The following is a suggested design:
You should test your code against a variety of data. The following DDL and test data from the TPC-H benchmark is provided as a starting point.
CREATE TABLE ORDERS (
O_ORDERKEY INTEGER NOT NULL,
O_CUSTKEY INTEGER NOT NULL,
O_ORDERSTATUS CHAR(1) NOT NULL,
O_TOTALPRICE DECIMAL(15,2) NOT NULL,
O_ORDERDATE DATE NOT NULL,
O_ORDERPRIORITY CHAR(15) NOT NULL,
O_CLERK CHAR(15) NOT NULL,
O_SHIPPRIORITY INTEGER NOT NULL,
O_COMMENT VARCHAR(79) NOT NULL);
A 16MB data for orders is available here.
1) [50 pts] A document describing how to use the runSQL
, loadCSV
programs and what each program does technically. The
document should:
2) [20 pts] Code, config files & scripts (compile3.sh and run3.sh) for Part 3. Submission of ANTLR grammar files is optional, but you must submit the generated files in order for your code to compile and run. Your code will be tested using MySQL databases.
3) [30 pts] Code, config files & scripts (compile4.sh and run4.sh) for Part 4. Your code will be tested using MySQL.
4) [extra 20 pts] Code, config files & scripts (compile.sh and run.sh) for Part 4X. Your code will be tested using MySQL.
1) Writing: submit a link to the google doc in Laulima->Assignments. Please ensure that you share your google doc with the instructor. Every individual should make a submission.
2) Code: You will submit your code by uploading your code onto the
submission server (you can also clone it from github).
Create a directory called hw2
in your home directory and
put your code in that directory. You need to ensure that
your code compiles and runs on the submission server. If you
need any additional libraries, please contact the TA to
install the libraries. Every group needs to make one
submission.
Access the submission server at IP address: 104.196.234.107
compile.sh
#!/bin/bash
jdbclib="/usr/local/lib/mysql-connector-java-5.1.40-bin.jar"
antlrlib="/usr/local/lib/antlr-4.6.jar"
javac -d ./bin -classpath $antlrlib:$jdbclib:. ./src/*.java
run.sh
#!/bin/bash
jdbclib="/usr/local/lib/mysql-connector-java-5.1.40-bin.jar"
antlrlib="/usr/local/lib/antlr-4.6.jar"
if [ $# != 2 ]
then
echo "Usage: run3.sh <configfile> <sqlfile>"
exit
fi
java -classpath $jdbclib:$antlrlib:./bin RunSQL $1 $2