ICS 421 - Spring 2010 - Project


Updated Apr 15 posted queries


You may work in a team of three-four students. You are encouraged to engage in general discussions with other teams regarding the project, but specific details of a solution, including the solution itself, must always be the team's own work.

Course Project (100 pts)

The goal of the project is to compare two distributed database system on a given benchmark data set and queries. At least one of the distributed database system needs to be the one that you implemented as part of the programming assignments. Here are the tasks that you will need to do. How you distribute the work among your team members is up to you!

Distributed DBMS Software For Comparison

Datelines

Deliverables

Submission Procedure

Submit:

antlrworks.jar is installed in /home/db2inst1/lib/antlrworks.jar on the submission machine

Pan-STARRS1 Benchmark

Brief Description:

The project will use a benchmark based on the Pan-STARRS project. In brief, the Pan-STARRS project scans the sky for astronomical objects that might pose a threat to Earth. Each scan of the sky produces detections of astronomical objects at different positions. By "stacking up" different scans of the same position/region of the sky, moving objects may be detected.

Data

The benchmark consists of the following tables with associated cardinalities:

    41983848 rows in Object
  1014349449 rows in Detection
   209919240 rows in StackDetection
    83646580 rows in StackApFlx
    83646580 rows in StackModelFit
    3458632  rows in Photoz
        2903 rows in FrameMeta
           5 rows in Filter
           2 rows in PhotozRecipe
Loaded into an SQLite database this produces a file of 164,978,269,184 bytes (before indexing on objectID). The DB2 DDL file for all the tables can be found in crtables.sql. crtables.db2 (optimized for DB2).

The data resides on ec2-67-202-8-208.compute-1.amazonaws.com in the /mnt/astronomy/ directory. Consult the wiki in laulima for more information.

Tne nneighbor (nearest neighbor) table hasn't been populated yet and is pending the actual neighbor determinations. If we limit the associations to the N nearest neighbors, this table will be N times larger than the number of objects. It could be smaller should we decide to limit it to N nearest neighbors within some fixed distance.

Since the current query workload only queries the object and detection table, you only need to partition and load those two tables.

Partitioning

We will partition the large tables on the objectID. The objectID is constructed from the position of the object in the sky (see image below).

The current PS1 database has adopted an approach that slices the sky into zones in declination (our default zone height is 0.008333 deg). The sky is partitioned into 16 slices with the declination widths adjusted to give approximately the same number of sources in each slice. All the large tables (objects, detections, stackdetection, stackapflx, stackmodelfit, etc.) are partitioned in this way. The partition key is the objectID which is an attribute in each of these tables. This approach is used to keep sources that are close together on the sky close together on the computer disks.

The calculation of the objectID given a source's reference position (right ascension and declination in degrees) is shown in the reproduction below of one of our PowerPoint slides. Given the object's declination we compute the zone id (and fraction) the integer part of the zone id comprises the first 5 digits of the objectID. The next 9 digits of the objectID are computed by multiplying the right ascension by 1,000,000 and taking the integer part of the result. The final 4 digits of the objectID are taken from the fractional part of the full zone id.

Thus, the objid includes the zoneid as its most significant digits, the right ascension, and declination. Examination of the object id computed in this way for real sources shows that the least significant digit of the objid is essentially distributed randomly over the digits 0-9 and could be used as the basis for a random hashing of the sources.

If one wants to apply a form of spatial partitioning, a simple approach would be to partition on the zoneid. This attribute can be found either in the Object table or by simply taking the first 5 digits of the objid.

The current data set has the following characteristics:

For the purpose of this project, you may partition the data along

Queries

List of SQL queries