ICS 421 - Spring 2010 - Programming Assignment 2
Updated Sat Feb 27. "nopartition" changed to "notpartition"
You may work in a team of two students, but each student needs to make a
submission. You are encouraged to engage in general discussions with other
teams regarding the assignment, but specific details of a solution, including
the solution itself, must always be the team's own work. You may submit the same
code as the rest of your team.
Part 3: Naive Distributed SQL Processor (40 pts)
Using your code from Part 1 as a template, write a program
runSQL that executes a given SQL statement on a
cluster of computers each running an instance
of a DBMS. The input to runSQL consists of two filenames
(stored in variables clustercfg and sqlfile) passed in
as commandline arguments. The file clustercfg contains
access information for the catalog DB. The
file sqlfile contains the SQL terminated by a semi-colon
to be executed. The runSQL program will execute the same
SQL on the database instance of each of the computers on the
cluster (that holds data fragments for the table)
concurrently using threads. One thread should be
spawned for each computer in the cluster.
runSQL should output the rows retrieved to the
standard output on success or report failure.
You may assumed that the SQL queries only operate on
single tables and do not contain any nested subqueries.
You should consider using the ANTLR compiler compiler to generate a SQL parser that you can use to extract the table name.
You may test your program on a single computer by using
different databases to simulate the multiple computers.
Sample contents of a clustercfg file
catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd
Sample contents of a sqlfile file
select * from books
;
Sample Output of `./run3.sh ./cluster.cfg ./books.sql`
123323232 Database Systems Ramakrishnan
234323423 Operating Systems Silberstein
[jdbc:db2://10.0.0.3:50001/mydb2]: ./books.sql success.
[jdbc:db2://10.0.0.3:50001/mydb1]: ./books.sql failed.
Part 4: Distributed Loader (60 pts)
Write a program loadCSV that loads
data from a comma-separated (CSV) file into a distributed
table on the cluster. The program takes two commandline
arguments clustercfg and csvfile. The
clustercfg file contains access information for the
catalog DB, the name of the table to be loaded, and the
partitioning information. The csvfile contains the
data to be loaded. The catalog should be consulted for
access information for the nodes in the cluster.
Your program should also update the catalog with the
partitioning information. The loader does NOT need to
be multi-threaded.
dtables(tname char(32),
nodedriver char(64),
nodeurl char(128),
nodeuser char(16),
nodepasswd char(16),
partmtd int,
nodeid int,
partcol char(32),
partparam1 char(32),
partparam2 char(32))
where
- tname is the table name involved in the DDL operation.
- nodeid is the node number of the node in the cluster.
- nodedriver is the driver used to connect to
the node in the cluster for this entry
- nodeurl is the JDBC URL of the node in the
cluster for this entry
- nodeuser and nodepasswd are the userid
and password of the DBMS instance at the node in the cluster
for this entry
- partmtd is the partition method used to
partition the data in the table
- 0: not partitioned at all
- 1: range partition: param1 and param2 will correspond to the min and max of the range
- 2: hash partition using mod for numeric types
Note that you will need to convert the partition method specified as a string
in the config file to the integer partition method stored in dtables. The
values in the config file corresponding partmtd 0, 1, and 2 are notpartition,
range, and hash respectively.
- partcol is the name of the column used for partitioning
- partparam1 and partparam2 are
parameters associated with the particular partition method
If the partition method is zero, ie, not partition, then the entire CSV file is inserted into the table at every node.
For range partitioning the rows that should be inserted into partitionX
should have a value in the partcol between the minimum and maximum of
the range for X:
partparam1 < partcol <= partparam2.
At the boundary ranges, partparam{1,2} may take the special values : -inf,
+inf.
For hash partitioning the rows that should be inserted into partitionX if
X = ( partcol mod partparam1 ) + 1.
The plus one is to handle the fact that our partition/node numbers start from 1 instead of 0. You may assume that only numeric columns will be partitioned for this assignment.
The number of nodes in the dtables relation and the number of partitions in the
config file should match. If not the program should return an error message and exit. You may assume the CSV file is error free (ie every row has the same number of columns of the right type).
Sample contents of a clustercfg file
catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd
tablename=books
partition.method=range
partition.column=age
numnodes=2
partition.node1.param1=1
partition.node1.param2=10
partition.node2.param1=10
partition.node2.param2=20
Another sample of a clustercfg file
catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd
tablename=books
partition.method=hash
partition.column=age
partition.param1=2
Sample csv file for books.csv
123323232,"Database Systems","Ramakrishnan,Raghu"
234323423,"Operating Systems","Silberstein, Adam"
Sample Output of `./run4.sh ./cluster.cfg ./books.csv`
[jdbc:db2://10.0.0.3:50001/mydb2]: 10 rows inserted.
[jdbc:db2://10.0.0.3:50001/mydb1]: 0 rows inserted.
[jdbc:db2://10.0.0.3:50001/mycatdb]: catalog updated.
Part 4X: Multi-threaded Distributed Loader (20 extra pts)
Convert your sequential Distributed Loader into a multi-threaded one. Note that
you will need to use synchronization or locking mechanisms (see java
synchronization). The following is a suggested design:
- A shared buffer/queue of a fixed configurable size will be allocated for
each node/thread.
- The main thread will iterate through the CSV file and partition each row.
- The main thread will copy a row to the corresponding buffer associated with the partition
- Each thread will constantly check its shared queue. If it is not empty, pop off one row and insert it into the associated database table.
- A special row or signal can be used to signal to the thread to finish.
- Note that the main thread could update each shared
queue, and each thread will continuously read from the
queue, so synchronization is needed to ensure concurrency.
- Note also that if all the buffers/queues are full, the
main thread may need to pause for a while
General Requirements
- You may program in any of the following languages: Java, C/C++, PHP, Python, Perl.
- You are encouraged to use ANTLR to generate the code to parse SQL statements as well as to parse the CSV file. It will save you a lot of time.
- You may use any third party libraries, but you need to provide them, so that your program is executable on the submission machine.
- Your code should minimally support select-from-where statements
- Your code should minimally work with DB2 Express C DBMS instances.
- You should ensure that your code is well commented and readable.
- You should develop the code on your own computer. In order for you to
receive credit for the assignment, your code must to be compilable and
executable on a linux machine. (A linux box will be provided for you to test
your code for compliance.)
- You must provide two shell scripts compile.sh and run.sh
to compile and run your code respectively. The script run.sh should
take clustercfg and sqlfile as its 1st and 2nd commandline
arguments
Submission Procedure
Submit:
- One set of source files for Part 3
- compile3.sh and run3.sh for Part 3
- One set of source files for Part 4
- compile4.sh and run4.sh for Part 4
- submission of grammar file is optional, you may just submit the generated files
- [OPTIONAL] If you are submitting Part 4x as well, you will need
compile4x.sh and run4x.sh as well.
antlrworks.jar is installed in /home/db2inst1/lib/antlrworks.jar on the submission machine
Follow the Submission Procedure