next up previous contents index
Next: A General Comparison between Up: The Test Data Previous: Introduction

The Basic Data Set

   In section 7.3.2, we already used several sets of real temporal data in order to get an idea about realistic sizes of IP-tables. One of these sets was login information about accesses to a supercomputer system at the Edinburgh Parallel Computing Centre (EPCC) . The set comprises accesses over a period of approximately five years. We used this set as a base for creating test data for the experiments in this chapter. The initial set was in a form as shown in figure 10.1 and contained over 125000 entries. Lines that did not contain any suitable information - such as those marked with a * in figure 10.1 - were deleted. This process left us with a set of 121728 entries. The latter were translated into two temporal relations to which we will refer as R and Q. In both relations, timestamps' start- and endpoints are integer values. The relations' lifespans are the same, ranging from 0 to 10079. The intention behind this range is that it corresponds to a week-long period in terms of minutes: . The differences between R and Q are the following:


  
Figure: An extract of the original login information.

The procedures that created the two collections of interval timestamps, R and Q, achieved different profiles. However, the procedure for Q mapped the original range of to one of . This leads also to an increase in the lengths of the intervals. In fact, the average length of an interval   in R is 118.5 (minutes) so far, whereas in Q we find   (minutes). With respect to the join performance, this difference could subsume any effect that is caused by the different profiles. In order to avoid this, we applied an additional procedure change_lengths()  to bring and in line, namely to a value of 300 (minutes). The source code for change_lengths() is given in appendix C. Essentially, it randomly picks intervals and adds or deletes chronons  from them until the desired average length is achieved. We will use this procedure also for controlling the experiments in section 10.5.

The final profiles for R and Q are respectively shown in figures 10.2 and 10.3. $i_{\scriptscriptstyle R}(t)$  has seven peeks corresponding to the daytime hours of the seven weekdays Monday, Tuesday, ..., Sunday. As one can expect, there are less accesses during Saturdays and Sundays: the two rightmost peeks are significantly lower than the previous ones. In contrast,   describes the accesses during a day (if we ignore the values of the time axis for a moment): as one can expect, there is a sharp rise during the morning, with a little valley during lunch time. In the afternoon there is a second peek, followed by a sharp fall towards the evening. As we see from this interpretation of the profiles, there is a large number of factors that contribute to their shapes. This underlines the presence of a high statistical complexity that we can expect in many scenarios.

Table 10.1 summarises the main characteristics of R and Q. We will use R and Q as the base for the experiments; some of the parameters, however, will be varied such as and (section 10.5) or |R| and |Q| (section 10.6). We note that the parameters shown in table 10.1 approximately match those that were used for the uniform data experiments in section 8.5.2. Similarly, we assume the architectural parameters listed in table 10.2 which correspond to those in table 8.9. A parallel architecture with and and a single processor architecture ( and ) will be used in the experiments.




  
Figure: The periodic profile $i_{\scriptscriptstyle R}(t)$ of R.




  
Figure: The non-periodic profile of Q.


 
Table: The characteristics of the base relations R and Q.
Parameter R Q
size (in tuples) |R| = 121728 |Q| = 121728
profile periodic non-periodic
lifespan (in minutes) |L(R)| = 10080 |L(Q)| = 10080
(in minutes)
tuple size (in bytes) |r| = 500 |q| = 500


 
Table: The parameters describing the architecture that is used in the experiments.
Parameter Description Value
processor speed in MIPS 200 MIPS
free main memory per node 32 MB
disk I/O bandwidth per node 20 MB/sec
communication bandwidth 40 MB/sec
memory bandwidth per node 400 MB/sec
number of CPU instructions for processing a tuple in each step 1000
number of CPU instructions for hashing a tuple 1000
number of CPU instructions for initiating a data transfer 500
number of CPU instructions for initiating a disk I/O 500
b page size 4 kB


next up previous contents index
Next: A General Comparison between Up: The Test Data Previous: Introduction

Thomas Zurek