Project: Benchmarking Big Data Systems

Goals

Goal #1: Develop a benchmark based on a realistic application scenario, and

Goal #2: Compare experimentally at least two parallel/distributed data processing system using the benchmark you developed.

What is a benchmark ? A benchmark consists of a data set (or data generator) and a set of queries (or tasks) that are used to evaluate and compare different solutions (i.e. data processing systems) for performing those queries/tasks on the data set. A set of performance metrics is often also specified to standardize the evaluation criteria. An example of a benchmark is the TPC-H benchmark from which the orders table was extracted to test your homework assignments. Examples of performance metrics include running time for individual queries, number of queries per second, energy consumption per query etc.

You may select any parallel/distributed data processing system that is available to you. Apache software foundation has many open-source data processing systems (eg. Storm, Spark, Hive). You may also use commercial systems on a trial or academic license. One of the parallel/distributed data processing system can be the one that you implemented as part of the programming assignments.

Your Tasks

Here are the tasks that you will need to do. How you distribute the work among your team members is up to you!

Creating the benchmark:

Choose a data analysis domain or application (see Reading).
Find a data set or develop a data generator
Develop the data processing queries or task
Choose or develop the performance metrics that would be most relevant to the domain or application.

For the distributed database system you have developed:

Modify your program so that the benchmark data can be loaded and the queries can run.
Upload, install, and configure the you distributed database program on the Google Cloud VM nodes (provided).
Load the benchmark data. This may include tweaking the DDLs etc.
Write any additional scripts to run and time queries etc. This may include tweaking the syntax of the queries.
Run the queries from benchmark on different number of nodes, partitioning method etc.
Analyze and graph the results (using a spreadsheet for example). Do they make sense ? Can you explain the results ?
Iterate to optimize and tune your code if necessary.

For the parallel/distributed data processing system you will be testing against:

Read up on the distributed DBMS software that you will be using.
Download, install, and configure the distributed DBMS software on the Google Cloud VM nodes (provided).
Load the benchmark data into the system. This may include tweaking the DDLs etc.
Write any additional scripts to run and time queries etc. This may include tweaking the syntax of the queries.
Run the queries from benchmark on different number of nodes, partitioning method etc.
Analyze and graph the results (using a spreadsheet for example). Do they make sense ? Can you explain the results ?
Iterate to tune the system if necessary.
Compare the results. Iterate if necessary.

Deliverables

[5pts] A project description including (1) title, (2) team members, (3) data set, (4) list of queries in english, and (5) data processing systems to be benchmarked. The project description should be posted in Laulima->Discussions->Projects by Monday Apr 10 11:59pm.
[5pts] A draft of the report with missing content in the results/experiments section is due Wed May 3, 2017. The draft should be in google doc and the link submitted to laulima->assignments.
[35pts] A 10-minute presentation in class
[50pts] A single space 6-page written report modeled after an experimental comparison paper (eg. A Comparison of Approaches to Large-Scale Data Analysis, An Empirical Evaluation of Set Similarity Join Techniques) is due Wed May 10, 2017 and should be submitted to laulima->assignments. The 6 pages should contain at least 2000 words and MUST include diagrams and illustrations.
[5pts] Benchmark specification, data, queries, code, scripts is due Wed May 10, 2017 and should be submitted to laulima->assignments.