Project: Benchmarking Big Data Systems

Goals

Goal #1: Develop a benchmark based on a realistic application scenario, and

Goal #2: Compare experimentally at least two parallel/distributed data processing system using the benchmark you developed.

What is a benchmark ? A benchmark consists of a data set (or data generator) and a set of queries (or tasks) that are used to evaluate and compare different solutions (i.e. data processing systems) for performing those queries/tasks on the data set. A set of performance metrics is often also specified to standardize the evaluation criteria. An example of a benchmark is the TPC-H benchmark from which the orders table was extracted to test your homework assignments. Examples of performance metrics include running time for individual queries, number of queries per second, energy consumption per query etc.

You must select at least one parallel/distributed data processing system from the following list:

Storm
Kafka
Neo4j
Spark
Cassandra
ELK
Splunk

Other open-source or commercial data processing systems are possible, but you have to check with the instructor. You may also use commercial systems on a trial or academic license. One of the parallel/distributed data processing system can be the one that you implemented as part of the programming assignments.

The number of systems benchmarked must be proportional to the size of the project team and must be greater or equal to the size of the team plus one. A 1-person team must benchmark at least 2 systems; a 2-person team must benchmark at least 3 systems.

Your Tasks

Here are the tasks that you will need to do. How you distribute the work among your team members is up to you!

Creating the benchmark:

Choose a data analysis domain or application (see Reading).
Find a data set or develop a data generator
Develop the data processing queries or task
Choose or develop the performance metrics that would be most relevant to the domain or application.

For the distributed database system you have developed:

Modify your program so that the benchmark data can be loaded and the queries can run.
Upload, install, and configure the you distributed database program on the Google Cloud VM nodes (provided).
Load the benchmark data. This may include tweaking the DDLs etc.
Write any additional scripts to run and time queries etc. This may include tweaking the syntax of the queries.
Run the queries from benchmark on different number of nodes, partitioning method etc.
Analyze and graph the results (using a spreadsheet for example). Do they make sense ? Can you explain the results ?
Iterate to optimize and tune your code if necessary.

For the parallel/distributed data processing system you will be testing against:

Read up on the distributed DBMS software that you will be using.
Download, install, and configure the distributed DBMS software on the Google Cloud VM nodes (provided).
Load the benchmark data into the system. This may include tweaking the DDLs etc.
Write any additional scripts to run and time queries etc. This may include tweaking the syntax of the queries.
Run the queries from benchmark on different number of nodes, partitioning method etc.
Analyze and graph the results (using a spreadsheet for example). Do they make sense ? Can you explain the results ?
Iterate to tune the system if necessary.
Compare the results. Iterate if necessary.

Deliverables

[5pts] A project description including (1) title, (2) team members, (3) data set, (4) list of queries in english, and (5) data processing systems to be benchmarked. The project description should be posted in Laulima->Discussions->Projects by Monday Apr 2 11:59pm.
[40pts] A 10-minute presentation in class
[50pts] A single space 6-page written report modeled after an experimental comparison paper (eg. A Comparison of Approaches to Large-Scale Data Analysis, An Empirical Evaluation of Set Similarity Join Techniques) is due May 10, 2018 and should be submitted to laulima->assignments. The 6 pages should contain at least 2000 words and MUST include diagrams and illustrations.
[5pts] Benchmark specification, data, queries, code, scripts is due May 10, 2018 and should be submitted to laulima->assignments.