Project: Benchmarking Big Data Systems

Goals

Goal #1: Develop a benchmark based on a realistic application scenario, and

Goal #2: Compare experimentally at least two parallel/distributed data processing system using the benchmark you developed.

What is a benchmark ? A benchmark consists of a data set (or data generator) and a set of queries (or tasks) that are used to evaluate and compare different solutions (i.e. data processing systems) for performing those queries/tasks on the data set. A set of performance metrics is often also specified to standardize the evaluation criteria. An example of a benchmark is the TPC-H benchmark from which the orders table was extracted to test your homework assignments. Examples of performance metrics include running time for individual queries, number of queries per second, energy consumption per query etc.

You must select at least one parallel/distributed data processing system from the following list:

Other open-source or commercial data processing systems are possible, but you have to check with the instructor. You may also use commercial systems on a trial or academic license. One of the parallel/distributed data processing system can be the one that you implemented as part of the programming assignments.

The number of systems benchmarked must be proportional to the size of the project team and must be greater or equal to the size of the team plus one. A 1-person team must benchmark at least 2 systems; a 2-person team must benchmark at least 3 systems.

Your Tasks

Here are the tasks that you will need to do. How you distribute the work among your team members is up to you!

Creating the benchmark:

For the distributed database system you have developed:

For the parallel/distributed data processing system you will be testing against:

Deliverables

  1. [5pts] A project description including (1) title, (2) team members, (3) data set, (4) list of queries in english, and (5) data processing systems to be benchmarked. The project description should be posted in Laulima->Discussions->Projects by Monday Apr 2 11:59pm.

  2. [40pts] A 10-minute presentation in class

  3. [50pts] A single space 6-page written report modeled after an experimental comparison paper (eg. A Comparison of Approaches to Large-Scale Data Analysis, An Empirical Evaluation of Set Similarity Join Techniques) is due May 10, 2018 and should be submitted to laulima->assignments. The 6 pages should contain at least 2000 words and MUST include diagrams and illustrations.

  4. [5pts] Benchmark specification, data, queries, code, scripts is due May 10, 2018 and should be submitted to laulima->assignments.