Assignment 3: Parallel SQL Processing Supporting Joins

Part 5: Parallel Join Algorithm

Modify your code from Part 3 (runSQL) to support select-from-where queries involving joins between exactly two tables. The two tables to be joined may be distributed tables with fragments residing on multiple nodes. The input to runSQL consists of two filenames (stored in variables clustercfg and sqlfile) passed in as commandline arguments. The file clustercfg contains access information for the catalog DB. The file sqlfile contains the SQL terminated by a semi-colon to be executed. The join query needs to be executed on multiple threads/processes; However, you should develop the sequential non-multi-threaded version first and convert the join QEP into a multi-threaded one. runSQL should output the rows retrieved to the standard output on success or report failure.

You may assumed that the SQL queries only operate on at most two tables and do not contain any nested subqueries.

You should consider using the ANTLR compiler compiler to generate a SQL parser that you can use to extract the table names and to detect the join conditions. Only equi-joins need to be supported.

Policy on grammar sharing: You may discuss, share, and/or develop the grammar with other groups.

Sample contents of a clustercfg file


catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd

localnode.driver=com.ibm.db2.jcc.DB2Driver
localnode.hostname=10.0.0.3:50001/mydb1
localnode.username=db2inst1
localnode.passwd=mypasswd

localnode identifies the client node in the cluster that received the query and that will co-ordinate the execution of the query.

Sample contents of a sqlfile file


SELECT S.sname 
FROM Sailors S, Reserves R
WHERE S.sid=R.sid and R.day='2009-12-21'
;

Sample Output of ./runSQL.py ./cluster.cfg ./sailors.sql


Ramakrishnan
Silberstein
[jdbc:db2://10.0.0.3:50001/mydb2]: ./sailors.sql success.
[jdbc:db2://10.0.0.3:50001/mydb1]: ./sailors.sql success.

It is recommended that you develop your program in the following steps


Part 6: Code Integration

Integrate all the code from all the previous programming assignments into one program runSQL.py. Fix any defects from previous assignments.

For the load case, if the config file contains tablename, runSQL behaves like a loader and the sqlfile (second commandline argument) is treated as the csv file.

Otherwise, the operation is detected from the SQL statement in sqlfile.

Please ensure that query output is in pipe delimited format. No trailing whitespace for char/varchar data types should be printed in the query output. Floating point numbers should be printed up to 3 decimal spaces.


Part 7: Write Testcases

Write at least one testcase for each of the following functionality of your final program in Part 6. (There should be at least 4 testcases in total). Testcases ONLY need to run on a single machine (using different port numbers to simulate a cluster).

  1. creating tables using DDLs
  2. loading data into the tables created
  3. querying the distributed tables using SQL on a single table
  4. querying the distributed tables using SQL joining two tables

Each testcase must test a non-trivial feature or code path including exception handling. Repeating the same create table statement test with different table names is considered trivial. You should name your testcase using the format:

test<function_num>-<userid>-<testnum>.{cfg,sql,csv,...}

Suppose your user name is jane and your first testcase attempts to test the create table functionality from Assignment 1, then your testcase would consist of the following files:

Each testcase would be executed in the following sequence of shell commands:

bash $ ./test1-jane-1.pre
bash $ ./runSQL.py test1-jane-1.cfg test1-jane-1.sql | sort > test1-jane-1.out
bash $ ./test1-jane-1.post | sort > test1-jane-1.post.out
bash $ diff ./test1-jane-1.post.out ./test1-jane-1.post.exp

When the output to be compared is in test1-jane-1.out (eg. for the SQL queries), then the diff is performed on test1-jane-1.out and test1-jane-1.post.exp


Deliverables

1) [50 pts] A document describing how to use the runSQL program and the program does technically. The document should:

2) [30 pts] Code, config files & scripts (runSQL.py, parDBd.py and dependencies). Submission of ANTLR grammar files is optional, but you must submit the generated files in order for your code to compile and run.

3) [20 pts] Testcases.

Submission Procedure

1) Submit all files to Laulima->Assignments

2) Schedule an appointment with the TA, so that can you can demo your code & testcases to the TA.


Sample Testcase Files

Example of test1-jane-1.pre (assume password for MySQL is XXXXXX)

#!/bin/bash

mysqladmin -u root -pXXXXXX drop lydb1
mysqladmin -u root -pXXXXXX create lydb1
mysql -u root -pXXXXXX lydb1 < test1-jane-1.pre.sql
mysql -u root -pXXXXXX lydb2 < test1-jane-1.pre.sql

Example of test1-jane-1.pre.sql

drop table XXTNXX;

Example of test1-jane-1.cfg

catalog.driver=com.mysql.jdbc.Driver
catalog.hostname=jdbc:mysql://localhost:3306/lydb1
catalog.username=root
catalog.passwd=XXXXXX

numnodes=2

node1.driver=com.mysql.jdbc.Driver
node1.hostname=jdbc:mysql://localhost:3306/lydb1
node1.username=root
node1.passwd=XXXXXX

node2.driver=com.mysql.jdbc.Driver
node2.hostname=jdbc:mysql://localhost:3306/lydb2
node2.username=root
node2.passwd=XXXXXX

Example of test1-jane-1.sql

create table XXTNXX (col1 int, col2 char(13), col3 decimal);

Example of test1-jane-1.post

#!/bin/bash

mysql -u root -pXXXXXX lydb1 < test1-jane-1.post.1.sql
mysql -u root -pXXXXXX lydb2 < test1-jane-1.post.1.sql
mysql -u root -pXXXXXX lydb1 < test1-jane-1.post.2.sql

Example of test1-jane-1.post.1.sql

select * from XXTNXX;

Example of test1-jane-1.post.2.sql

select nodedriver, nodeurl, nodeuser 
from dtables
where tname='XXTNXX';

You will need to generate an expected output file that you manually check to be correct.