In section 7.3.2, we already used several sets of real temporal data in order to get an idea about realistic sizes of IP-tables. One of these sets was login information about accesses to a supercomputer system at the Edinburgh Parallel Computing Centre (EPCC) . The set comprises accesses over a period of approximately five years. We used this set as a base for creating test data for the experiments in this chapter. The initial set was in a form as shown in figure 10.1 and contained over 125000 entries. Lines that did not contain any suitable information - such as those marked with a * in figure 10.1 - were deleted. This process left us with a set of 121728 entries. The latter were translated into two temporal relations to which we will refer as R and Q. In both relations, timestamps' start- and endpoints are integer values. The relations' lifespans are the same, ranging from 0 to 10079. The intention behind this range is that it corresponds to a week-long period in terms of minutes: . The differences between R and Q are the following:
The timestamp intervals of R were created from the login information
in the following way: of each line, only the weekday and the start and
end times were used. Times on Mondays were converted to numbers , times on Tuesdays to , times on Wednesday
to , etc. For example the line
root ttyp3 yanis.epcc.ed.a Sun Oct 27 11:45 - 11:46 (00:00)
results in the interval
The source code for the PERL script that converts login data as in figure 10.1 can be found in appendix B.1.The timestamp intervals of Q were created from the login information in the following way: of each line, only the start and end times were used. These times were converted to minutes of the day, i.e. mapped to a range :
In a second step, these times were mapped to the range by multiplying them with 7 and adding a random number between 0 and 6. The random number avoids all interval start- and endpoints being multiples of 7. As an example, we consider again the line
root ttyp3 yanis.epcc.ed.a Sun Oct 27 11:45 - 11:46 (00:00)
which results in the interval
where rand(0 ...6) is supposed to randomly choose a number from the set . We note that the result is not deterministic because of the random numbers. The source code for the PERL script that converted login data as in figure 10.1 can be found in appendix B.2.
The procedures that created the two collections of interval timestamps, R and Q, achieved different profiles. However, the procedure for Q mapped the original range of to one of . This leads also to an increase in the lengths of the intervals. In fact, the average length of an interval in R is 118.5 (minutes) so far, whereas in Q we find (minutes). With respect to the join performance, this difference could subsume any effect that is caused by the different profiles. In order to avoid this, we applied an additional procedure change_lengths() to bring and in line, namely to a value of 300 (minutes). The source code for change_lengths() is given in appendix C. Essentially, it randomly picks intervals and adds or deletes chronons from them until the desired average length is achieved. We will use this procedure also for controlling the experiments in section 10.5.
The final profiles for R and Q are respectively shown in
figures 10.2 and 10.3.
has seven peeks corresponding to the daytime
hours of the seven weekdays Monday, Tuesday, ..., Sunday. As one can
expect, there are less accesses during Saturdays and Sundays: the two
rightmost peeks are significantly lower than the previous ones. In
contrast, describes the accesses during a day
(if we ignore the values of the time axis for a moment): as one can
expect, there is a sharp rise during the morning, with a little valley
during lunch time. In the afternoon there is a second peek, followed
by a sharp fall towards the evening. As we see from this
interpretation of the profiles, there is a large number of factors
that contribute to their shapes. This underlines the presence of a
high statistical complexity that we can expect in many scenarios.
Table 10.1 summarises the main characteristics of R and Q. We will use R and Q as the base for the experiments; some of the parameters, however, will be varied such as and (section 10.5) or |R| and |Q| (section 10.6). We note that the parameters shown in table 10.1 approximately match those that were used for the uniform data experiments in section 8.5.2. Similarly, we assume the architectural parameters listed in table 10.2 which correspond to those in table 8.9. A parallel architecture with and and a single processor architecture ( and ) will be used in the experiments.
Parameter | R | Q |
---|---|---|
size (in tuples) | |R| = 121728 | |Q| = 121728 |
profile | periodic | non-periodic |
lifespan (in minutes) | |L(R)| = 10080 | |L(Q)| = 10080 |
(in minutes) | ||
tuple size (in bytes) | |r| = 500 | |q| = 500 |
Parameter | Description | Value |
---|---|---|
processor speed in MIPS | 200 MIPS | |
free main memory per node | 32 MB | |
disk I/O bandwidth per node | 20 MB/sec | |
communication bandwidth | 40 MB/sec | |
memory bandwidth per node | 400 MB/sec | |
number of CPU instructions for processing a tuple in each step | 1000 | |
number of CPU instructions for hashing a tuple | 1000 | |
number of CPU instructions for initiating a data transfer | 500 | |
number of CPU instructions for initiating a disk I/O | 500 | |
b | page size | 4 kB |