Goal #1: Develop a benchmark based on a realistic application scenario, and
Goal #2: Compare experimentally at least two parallel/distributed data processing system using the benchmark you developed.
What is a benchmark ? A benchmark consists of a data set
(or data generator) and a set of queries (or tasks) that are
used to evaluate and compare different solutions (i.e. data
processing systems) for performing those queries/tasks on
the data set. A set of performance metrics is often also
specified to standardize the evaluation criteria. An example
of a benchmark is the TPC-H
benchmark from which the orders
table was extracted to test your homework assignments.
Examples of performance metrics include running time for
individual queries, number of queries per second, energy
consumption per query etc.
You may select any parallel/distributed data processing system that is available to you. Apache software foundation has many open-source data processing systems (eg. Storm, Spark, Hive). You may also use commercial systems on a trial or academic license. One of the parallel/distributed data processing system can be the one that you implemented as part of the programming assignments.
Here are the tasks that you will need to do. How you distribute the work among your team members is up to you!
Creating the benchmark:
For the distributed database system you have developed:
For the parallel/distributed data processing system you will be testing against:
[5pts] A project description including (1) title, (2) team members, (3) data set, (4) list of queries in english, and (5) data processing systems to be benchmarked. The project description should be posted in Laulima->Discussions->Projects by Monday Apr 10 11:59pm.
[5pts] A draft of the report with missing content in the results/experiments section is due Wed May 3, 2017. The draft should be in google doc and the link submitted to laulima->assignments.
[35pts] A 10-minute presentation in class
[50pts] A single space 6-page written report modeled after an experimental comparison paper (eg. A Comparison of Approaches to Large-Scale Data Analysis, An Empirical Evaluation of Set Similarity Join Techniques) is due Wed May 10, 2017 and should be submitted to laulima->assignments. The 6 pages should contain at least 2000 words and MUST include diagrams and illustrations.
[5pts] Benchmark specification, data, queries, code, scripts is due Wed May 10, 2017 and should be submitted to laulima->assignments.