HW1: DDL Processing for Parallel DBMS | ICS421 Database Systems II (Spring 2018)

Assignment 1: DDL Processing for a Parallel DBMS

Learning Objectives

Know how to program a distributed system using sockets
Understand how to create tables in a parallel DBMS
Able to write a program to process DDLs including Able to update the catalog/metadata for a parallel SQL processing system
Able to document the program design and usage clearly

Background

This first assignment guides you through the implementation of the DDL processing component of a parallel SQL processing system. The parallel SQL processing system consists of a cluster of DBMS instances running on different machines (possibly virtual machines). DDLs submitted to the system will need to be translated into corresponding DDLs for each individual DBMS instance in the cluster and executed there. In addition a catalog database stores metadata about what data is stored for each table on each DBMS instance in the cluster.

Part 1: DDL Processor

Design & write a program runDDL that executes a given DDL statement on a cluster of computers each running an instance of a DBMS. The input to runDDL consists of two filenames (stored in variables clustercfg and ddlfile) passed in as commandline arguments. The file clustercfg contains access information for each computer on the cluster. The file ddlfile contains the DDL terminated by a semi-colon to be executed. The runDDL program will execute the same DDL on the database instance of each of the computers on the cluster concurrently using threads. One thread should be spawned for each computer in the cluster. The runDDL program should report success or failure of executing the DDL for each node in the cluster to standard output.

You may test your program on a single computer by using different databases to simulate the multiple computers. You should then test your program using a cluster of docker containers simulating a cluster of computers.

You should first write a sequential, non-threaded program that executes the DDL on each computer in the cluster in a for-loop and optionally convert the program to a parallel, multi-threaded version.

Part 2: Update Catalog Metadata

Modify your program from Part 1 so that it stores metadata about the DDL being executed in a catalog database that may or may not be on the local machine. The access information of the catalog database will be provided in the clustercfg file as well. The metadata should be stored in a table

dtables(tname char(32), 
   nodedriver char(64), 
   nodeurl char(128), 
   nodeuser char(16), 
   nodepasswd char(16), 
   partmtd int, 
   nodeid int, 
   partcol char(32), 
   partparam1 char(32),
   partparam2 char(32))

where

tname is the table name involved in the DDL operation.
nodeid is the node number associated with this node.
nodedriver is the driver used to connect to the node in the cluster for this entry (NOT USED IN S18)
nodeurl is the JDBC URL of the node in the cluster for this entry (SEE CLUSTERCFG EXAMPLE)
nodeuser and nodepasswd are the userid and password of the DBMS instance at the node in the cluster for this entry (NOT USED IN S18)
partmtd is the partition method used to partition the data in the table
partcol is the column(s) used by the partition method to partition the data in the table
partparam1 and partparam2 are parameters associated with the particular partition method

If this table does not exist in the catalog database, your program will create the table. The field tname should be obtained using a simple parsing of the DDL for the keyword TABLE that precedes the table name (we will switch to a more sophisticated SQL parser later). The fields partmtd, partcol, partparam1, partparam2 should be left as NULL for this assignment. This table should only be updated on successful execution of the DDLs. For create table DDL, this table should be populated and for drop table DDLs, the relevant entries in this table should be deleted. This operation need not be multi-threaded.

General Requirements

You may work on the code in a team of two students, but each student needs to make a submission.
For the writing portion, each student needs to write the documentation independently.
You are encouraged to engage in general discussions with other teams regarding the assignment, but specific details of a solution, including the solution itself, must always be the team’s own work.
You may submit the same code as the rest of your team, but the documentation must be individual work.
You must program in Python 3 (NOT python 2)
You must use sqlite3 as the local database. Since sqlite3 does not support client-server computing, you will need to write a server program parDBd.py for it. The server daemon parDBd.py should take two commandline arguments: IP address or hostname, and port number. See socket programming exercise
You may use any third party libraries, but you need to provide them, so that your program is executable on the submission machine.
Your code should minimally support the drop table and create table DDLs.
You are free to structure your code into modules, object classes etc. as you see fit.
You should ensure that your code is well commented and readable.
You should develop the code on your own computer. In order for you to receive credit for the assignment, your code must to be executable on a linux machine.

Deliverables

1) A document describing how to use the runDDL.py and parDBd.py program and what the program does technically. The document should:

be targeted at an audience of IT-savvy users,
include instructions for all pre-requisite steps to get the program running,
describe all input configuration files, parameters etc.
describe what the program does
describe expected output and error conditions
include examples and illustrations
be at least 750 words long in size 12 fonts not counting code examples.

2) Code, config files & scripts. Since Part 1 and Part 2 are cumulative, you should submit only one set of code (the files: the runDDL.py client program, the parDBd.py server program and their dependencies), i.e., the final code for Part 2.

Submission Procedure

Submit to Laulima->Assignments

Sample contents of a `clustercfg` file

Note that driver field is not used in S18, but kept here for backwards compatibility.


catalog.driver=com.ibm.db2.jcc.DB2Driver
catalog.hostname=10.0.0.3:50001/mycatdb

numnodes=2

node1.driver=com.ibm.db2.jcc.DB2Driver
node1.hostname=10.0.0.3:50001/mydb1

node2.driver=com.ibm.db2.jcc.DB2Driver
node2.hostname=10.0.0.3:50001/mydb2

Sample contents of a `ddlfile` file


CREATE TABLE BOOKS(isbn char(14), title char(80), price
decimal);

Sample Output of `./run.sh ./cluster.cfg ./books.sql`


[10.0.0.3:50001/mydb2]: ./books.sql success.
[10.0.0.3:50001/mydb1]: ./books.sql failed.
[10.0.0.3:50001/mycatdb]: catalog updated.