Assignment 2: Naive SQL Processing for Parallel DBMS

Learning Objectives


Part 3: Naive Parallel SQL Processor

Using your code from Part 1 as a template, write a program runSQL that executes a given SQL statement on a cluster of computers each running an instance of a DBMS. The input to runSQL consists of two filenames (stored in variables clustercfg and sqlfile) passed in as commandline arguments. The file clustercfg contains access information for the catalog DB. The file sqlfile contains the SQL terminated by a semi-colon to be executed. The runSQL program will execute the same SQL on the database instance of each of the computers on the cluster (that holds data fragments for the table) concurrently using threads. One thread should be spawned for each computer in the cluster. The runSQL programm should output the rows retrieved to the standard output on success or report failure.

You may assumed that the SQL queries only operate on single tables and do not contain any nested subqueries.

You should consider using the ANTLR compiler compiler to generate a SQL parser that you can use to extract the table name.

You may test your program on a single computer by using different databases to simulate the multiple computers.

Sample contents of a clustercfg file


catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd

Sample contents of a sqlfile file

SELECT *
FROM Books;

Sample Output of ./run3.sh ./cluster.cfg ./books.sql

123323232 Database Systems       Ramakrishnan
234323423 Operating Systems      Silberstein
[jdbc:db2://10.0.0.3:50001/mydb2]: ./books.sql success.
[jdbc:db2://10.0.0.3:50001/mydb1]: ./books.sql failed.

Part 4: Parallel Loader

Write a program loadCSV that loads data from a comma-separated (CSV) file into a distributed table on the cluster. The program takes two commandline arguments clustercfg and csvfile. The clustercfg file contains access information for the catalog DB, the name of the table to be loaded, and the partitioning information. The csvfile contains the data to be loaded. The catalog should be consulted for access information for the nodes in the cluster. Your program should also update the catalog with the partitioning information. The loader does NOT need to be multi-threaded. You should use a library for parsing CSV instead of writing your own from scratch.

dtables(tname char(32), 
   nodedriver char(64), 
   nodeurl char(128), 
   nodeuser char(16), 
   nodepasswd char(16), 
   partmtd int, 
   nodeid int, 
   partcol char(32), 
   partparam1 char(32),
   partparam2 char(32))

where

The partmtd should take the following values:

Note that you will need to convert the partition method specified as a string in the config file to the integer partition method stored in dtables. The values in the config file corresponding partmtd 0, 1, and 2 are notpartition, range, and hash respectively.

If the partition method is zero, ie, not partition, then the entire CSV file is inserted into the table at every node.

For range partitioning the rows that should be inserted into partition X should have a value in theipartcol between the minimum and maximum of the range for X:

partparam1 < partcol <= partparam2.

At the boundary ranges, partparam{1,2} may take the special values : -inf, +inf.

For hash partitioning the rows that should be inserted into partition X if

X = ( partcol mod partparam1 ) + 1.

The plus one is to handle the fact that our partition/node numbers start from 1 instead of 0. You may assume that only numeric columns will be partitioned for this assignment. The number of nodes in the dtables relation and the number of partitions in the config file should match. If not the program should return an error message and exit. You may assume the CSV file is error free (ie every row has the same number of columns of the right type).

Sample contents of a clustercfg file


catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd

tablename=books
partition.method=range
partition.column=age

numnodes=2
partition.node1.param1=1
partition.node1.param2=10

partition.node2.param1=10
partition.node2.param2=20

Another sample of a clustercfg file


catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=jdbc:db2://10.0.0.3:50001/mycatdb
catalog.username=db2inst1
catalog.passwd=mypasswd

tablename=books

partition.method=hash
partition.column=age
partition.param1=2

Sample csv file for books.csv


123323232,"Database Systems","Ramakrishnan,Raghu"
234323423,"Operating Systems","Silberstein, Adam"

Sample Output of ./run4.sh ./cluster.cfg ./books.csv


[jdbc:db2://10.0.0.3:50001/mydb2]: 10 rows inserted.
[jdbc:db2://10.0.0.3:50001/mydb1]: 0 rows inserted.
[jdbc:db2://10.0.0.3:50001/mycatdb]: catalog updated.

Part 4X: Multi-threaded Distributed Loader (20 extra pts)

Convert your sequential Distributed Loader into a multi-threaded one. Note that you will need to use synchronization or locking mechanisms (see java synchronization). The following is a suggested design:


Test Data

You should test your code against a variety of data. The following DDL and test data from the TPC-H benchmark is provided as a starting point.

CREATE TABLE ORDERS  ( 
O_ORDERKEY       INTEGER NOT NULL,
O_CUSTKEY        INTEGER NOT NULL,
O_ORDERSTATUS    CHAR(1) NOT NULL,
O_TOTALPRICE     DECIMAL(15,2) NOT NULL,
O_ORDERDATE      DATE NOT NULL,
O_ORDERPRIORITY  CHAR(15) NOT NULL,  
O_CLERK          CHAR(15) NOT NULL, 
O_SHIPPRIORITY   INTEGER NOT NULL,
O_COMMENT        VARCHAR(79) NOT NULL);

A 16MB data for orders is available here.


Deliverables

1) [50 pts] A document describing how to use the runSQL, loadCSV programs and what each program does technically. The document should:

2) [20 pts] Code, config files & scripts (compile3.sh and run3.sh) for Part 3. Submission of ANTLR grammar files is optional, but you must submit the generated files in order for your code to compile and run. Your code will be tested using MySQL databases.

3) [30 pts] Code, config files & scripts (compile4.sh and run4.sh) for Part 4. Your code will be tested using MySQL.

4) [extra 20 pts] Code, config files & scripts (compile.sh and run.sh) for Part 4X. Your code will be tested using MySQL.

Submission Procedure

1) Writing: submit a link to the google doc in Laulima->Assignments. Please ensure that you share your google doc with the instructor. Every individual should make a submission.

2) Code: You will submit your code by uploading your code onto the submission server (you can also clone it from github). Create a directory called hw2 in your home directory and put your code in that directory. You need to ensure that your code compiles and runs on the submission server. If you need any additional libraries, please contact the TA to install the libraries. Every group needs to make one submission.

Access the submission server at IP address: 104.196.234.107


Sample compile.sh

#!/bin/bash

jdbclib="/usr/local/lib/mysql-connector-java-5.1.40-bin.jar"
antlrlib="/usr/local/lib/antlr-4.6.jar"

javac -d ./bin -classpath $antlrlib:$jdbclib:. ./src/*.java

Sample run.sh

#!/bin/bash

jdbclib="/usr/local/lib/mysql-connector-java-5.1.40-bin.jar"
antlrlib="/usr/local/lib/antlr-4.6.jar"

if [ $# != 2 ]
then
  echo "Usage: run3.sh <configfile> <sqlfile>"
  exit
fi

java -classpath $jdbclib:$antlrlib:./bin RunSQL $1 $2