ICS 421 - Spring 2010 - Programming Assignment 2


Updated Sat Feb 27. "nopartition" changed to "notpartition"


You may work in a team of two students, but each student needs to make a submission. You are encouraged to engage in general discussions with other teams regarding the assignment, but specific details of a solution, including the solution itself, must always be the team's own work. You may submit the same code as the rest of your team.

Part 3: Naive Distributed SQL Processor (40 pts)

Using your code from Part 1 as a template, write a program runSQL that executes a given SQL statement on a cluster of computers each running an instance of a DBMS. The input to runSQL consists of two filenames (stored in variables clustercfg and sqlfile) passed in as commandline arguments. The file clustercfg contains access information for the catalog DB. The file sqlfile contains the SQL terminated by a semi-colon to be executed. The runSQL program will execute the same SQL on the database instance of each of the computers on the cluster (that holds data fragments for the table) concurrently using threads. One thread should be spawned for each computer in the cluster.

runSQL should output the rows retrieved to the standard output on success or report failure.

You may assumed that the SQL queries only operate on single tables and do not contain any nested subqueries.

You should consider using the ANTLR compiler compiler to generate a SQL parser that you can use to extract the table name.

You may test your program on a single computer by using different databases to simulate the multiple computers.


Sample contents of a clustercfg file


Sample contents of a sqlfile file


Sample Output of `./run3.sh ./cluster.cfg ./books.sql`


Part 4: Distributed Loader (60 pts)

Write a program loadCSV that loads data from a comma-separated (CSV) file into a distributed table on the cluster. The program takes two commandline arguments clustercfg and csvfile. The clustercfg file contains access information for the catalog DB, the name of the table to be loaded, and the partitioning information. The csvfile contains the data to be loaded. The catalog should be consulted for access information for the nodes in the cluster. Your program should also update the catalog with the partitioning information. The loader does NOT need to be multi-threaded.

dtables(tname char(32), nodedriver char(64), nodeurl char(128), nodeuser char(16), nodepasswd char(16), partmtd int, nodeid int, partcol char(32), partparam1 char(32), partparam2 char(32)) where

If the partition method is zero, ie, not partition, then the entire CSV file is inserted into the table at every node.

For range partitioning the rows that should be inserted into partitionX should have a value in the partcol between the minimum and maximum of the range for X:

At the boundary ranges, partparam{1,2} may take the special values : -inf, +inf.

For hash partitioning the rows that should be inserted into partitionX if

The plus one is to handle the fact that our partition/node numbers start from 1 instead of 0. You may assume that only numeric columns will be partitioned for this assignment.

The number of nodes in the dtables relation and the number of partitions in the config file should match. If not the program should return an error message and exit. You may assume the CSV file is error free (ie every row has the same number of columns of the right type).


Sample contents of a clustercfg file


Another sample of a clustercfg file


Sample csv file for books.csv


Sample Output of `./run4.sh ./cluster.cfg ./books.csv`


Part 4X: Multi-threaded Distributed Loader (20 extra pts)

Convert your sequential Distributed Loader into a multi-threaded one. Note that you will need to use synchronization or locking mechanisms (see java synchronization). The following is a suggested design:

General Requirements

Submission Procedure

Submit:

antlrworks.jar is installed in /home/db2inst1/lib/antlrworks.jar on the submission machine

Follow the Submission Procedure