HW3: Parallel JOIN Processing for Parallel DBMS | ICS421 Database Systems II (Spring 2018)

Assignment 3: Parallel SQL Processing Supporting Joins

Part 5: Parallel Join Algorithm

Modify your code from Part 3 (runSQL) to support select-from-where queries involving joins between exactly two tables. The two tables to be joined may be distributed tables with fragments residing on multiple nodes. The input to runSQL consists of two filenames (stored in variables clustercfg and sqlfile) passed in as commandline arguments. The file clustercfg contains access information for the catalog DB. The file sqlfile contains the SQL terminated by a semi-colon to be executed. The join query needs to be executed on multiple threads/processes; However, you should develop the sequential non-multi-threaded version first and convert the join QEP into a multi-threaded one. runSQL should output the rows retrieved to the standard output on success or report failure.

You may assumed that the SQL queries only operate on at most two tables and do not contain any nested subqueries.

You should consider using the ANTLR compiler compiler to generate a SQL parser that you can use to extract the table names and to detect the join conditions. Only equi-joins need to be supported.

Policy on grammar sharing: You may discuss, share, and/or develop the grammar with other groups.

Sample contents of a `clustercfg` file


catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd

localnode.driver=com.ibm.db2.jcc.DB2Driver
localnode.hostname=10.0.0.3:50001/mydb1
localnode.username=db2inst1
localnode.passwd=mypasswd

localnode identifies the client node in the cluster that received the query and that will co-ordinate the execution of the query.

Sample contents of a `sqlfile` file


SELECT S.sname 
FROM Sailors S, Reserves R
WHERE S.sid=R.sid and R.day='2009-12-21'
;

Sample Output of `./runSQL.py ./cluster.cfg ./sailors.sql`


Ramakrishnan
Silberstein
[jdbc:db2://10.0.0.3:50001/mydb2]: ./sailors.sql success.
[jdbc:db2://10.0.0.3:50001/mydb1]: ./sailors.sql success.

It is recommended that you develop your program in the following steps

Focus on the distributed join algorithm first. Assume no multi-threading, no projection or additional selection operators in the QEP.
Once the distributed join algorithm is implemented and tested, work on incorporating projection.
Next incorporate additional filtering predicates (ie selection operators) from the where-clause
Keep in mind to pushdown predicates/selections and projections as much as possible.
Convert the join operation to a multi-threaded version by assigning each join between two fragments to a thread/process.
You may need to use temporary tables for your joins. On Derby & DB2 you will need to use the following statement: DECLARE GLOBAL TEMPORARY TABLE. For sqlite3, you may use CREATE TEMP TABLE.

Part 6: Code Integration

Integrate all the code from all the previous programming assignments into one program runSQL.py. Fix any defects from previous assignments.

For the load case, if the config file contains tablename, runSQL behaves like a loader and the sqlfile (second commandline argument) is treated as the csv file.

Otherwise, the operation is detected from the SQL statement in sqlfile.

Please ensure that query output is in pipe delimited format. No trailing whitespace for char/varchar data types should be printed in the query output. Floating point numbers should be printed up to 3 decimal spaces.

Part 7: Write Testcases

Write at least one testcase for each of the following functionality of your final program in Part 6. (There should be at least 4 testcases in total). Testcases ONLY need to run on a single machine (using different port numbers to simulate a cluster).

creating tables using DDLs
loading data into the tables created
querying the distributed tables using SQL on a single table
querying the distributed tables using SQL joining two tables

Each testcase must test a non-trivial feature or code path including exception handling. Repeating the same create table statement test with different table names is considered trivial. You should name your testcase using the format:

test<function_num>-<userid>-<testnum>.{cfg,sql,csv,...}

Suppose your user name is jane and your first testcase attempts to test the create table functionality from Assignment 1, then your testcase would consist of the following files:

test1-jane-1.cfg : config file
test1-jane-1.sql : sql or csv file
test1-jane-1.pre : shell script that sets up the required tables for the test
test1-jane-1.post : shell script that queries the databases after the test
test1-jane-1.post.exp : expected output of previous sql script
test1-jane-1.grep : alternative to exp file. the grep file contains a list of patterns that must be present in the output of the program.
test1-jane-1.txt: a description of what the test is testing.

Each testcase would be executed in the following sequence of shell commands:

bash $ ./test1-jane-1.pre
bash $ ./runSQL.py test1-jane-1.cfg test1-jane-1.sql | sort > test1-jane-1.out
bash $ ./test1-jane-1.post | sort > test1-jane-1.post.out
bash $ diff ./test1-jane-1.post.out ./test1-jane-1.post.exp

When the output to be compared is in test1-jane-1.out (eg. for the SQL queries), then the diff is performed on test1-jane-1.out and test1-jane-1.post.exp

Deliverables

1) [50 pts] A document describing how to use the runSQL program and the program does technically. The document should:

be targeted at an audience of IT-savvy users,
include instructions for all pre-requisite steps to get the program running,
describe all input configuration files, parameters etc.
describe what the program does
describe expected output and error conditions
include examples and illustrations
be at least 750 words long in size 12 fonts not including diagrams.

2) [30 pts] Code, config files & scripts (runSQL.py, parDBd.py and dependencies). Submission of ANTLR grammar files is optional, but you must submit the generated files in order for your code to compile and run.

3) [20 pts] Testcases.

Submission Procedure

1) Submit all files to Laulima->Assignments

2) Schedule an appointment with the TA, so that can you can demo your code & testcases to the TA.

Sample Testcase Files

Example of test1-jane-1.pre (assume password for MySQL is XXXXXX)

#!/bin/bash

mysqladmin -u root -pXXXXXX drop lydb1
mysqladmin -u root -pXXXXXX create lydb1
mysql -u root -pXXXXXX lydb1 < test1-jane-1.pre.sql
mysql -u root -pXXXXXX lydb2 < test1-jane-1.pre.sql

Example of test1-jane-1.pre.sql

drop table XXTNXX;

Example of test1-jane-1.cfg

catalog.driver=com.mysql.jdbc.Driver
catalog.hostname=jdbc:mysql://localhost:3306/lydb1
catalog.username=root
catalog.passwd=XXXXXX

numnodes=2

node1.driver=com.mysql.jdbc.Driver
node1.hostname=jdbc:mysql://localhost:3306/lydb1
node1.username=root
node1.passwd=XXXXXX

node2.driver=com.mysql.jdbc.Driver
node2.hostname=jdbc:mysql://localhost:3306/lydb2
node2.username=root
node2.passwd=XXXXXX

Example of test1-jane-1.sql

create table XXTNXX (col1 int, col2 char(13), col3 decimal);

Example of test1-jane-1.post

#!/bin/bash

mysql -u root -pXXXXXX lydb1 < test1-jane-1.post.1.sql
mysql -u root -pXXXXXX lydb2 < test1-jane-1.post.1.sql
mysql -u root -pXXXXXX lydb1 < test1-jane-1.post.2.sql

Example of test1-jane-1.post.1.sql

select * from XXTNXX;

Example of test1-jane-1.post.2.sql

select nodedriver, nodeurl, nodeuser 
from dtables
where tname='XXTNXX';

You will need to generate an expected output file that you manually check to be correct.