The Basic Data Set

Next: A General Comparison between Up: The Test Data Previous: Introduction

The Basic Data Set

In section 7.3.2, we already used several sets of real temporal data in order to get an idea about realistic sizes of IP-tables. One of these sets was login information about accesses to a supercomputer system at the Edinburgh Parallel Computing Centre (EPCC) . The set comprises accesses over a period of approximately five years. We used this set as a base for creating test data for the experiments in this chapter. The initial set was in a form as shown in figure 10.1 and contained over 125000 entries. Lines that did not contain any suitable information - such as those marked with a * in figure 10.1 - were deleted. This process left us with a set of 121728 entries. The latter were translated into two temporal relations to which we will refer as R and Q. In both relations, timestamps' start- and endpoints are integer values. The relations' lifespans are the same, ranging from 0 to 10079. The intention behind this range is that it corresponds to a week-long period in terms of minutes: . The differences between R and Q are the following:

R has, what we call, a periodic profile . This means that the function - which shows the number of tuples that have a timestamp that intersects with time t - consists of a pattern that is periodically repeated. In the case of the login data, one can assume that the login behaviour of the users repeats itself every day with weekends showing a reduced number of accesses. We can expect a similar profile in many other example scenarios such as the distribution and lengths of phone calls over a period of several days or holiday bookings (assuming yearly repeated patterns in the latter case). Figure 10.2 shows the periodic profile of R. It will be discussed below in some more detail.
The timestamp intervals of R were created from the login information in the following way: of each line, only the weekday and the start and end times were used. Times on Mondays were converted to numbers , times on Tuesdays to , times on Wednesday to , etc. For example the line
```
root   ttyp3   yanis.epcc.ed.a  Sun Oct 27 11:45 - 11:46  (00:00)
```
results in the interval
The source code for the PERL script that converts login data as in figure 10.1 can be found in appendix B.1.
Q has, what we call, a non-periodic profile . This means that the function - which shows the number of tuples that have a timestamp that intersects with time t - does not show patterns that are periodically repeated. In the case of the login data, one can assume that the login behaviour of the users during one day is non-periodic: from early morning onwards, there is a gradually growing number of users logging into the system. In the afternoon, this number starts to decrease with only a few users being logged in during the night. A similar scenario is again the distribution and lengths of phone calls during a single day. Figure 10.3 shows the non-periodic profile of Q. It will be discussed below in some more detail.
The timestamp intervals of Q were created from the login information in the following way: of each line, only the start and end times were used. These times were converted to minutes of the day, i.e. mapped to a range :
In a second step, these times were mapped to the range by multiplying them with 7 and adding a random number between 0 and 6. The random number avoids all interval start- and endpoints being multiples of 7. As an example, we consider again the line
```
root   ttyp3   yanis.epcc.ed.a  Sun Oct 27 11:45 - 11:46  (00:00)
```
which results in the interval
where rand(0 ...6) is supposed to randomly choose a number from the set . We note that the result is not deterministic because of the random numbers. The source code for the PERL script that converted login data as in figure 10.1 can be found in appendix B.2.

**Figure:** An extract of the original login information.

The procedures that created the two collections of interval timestamps, R and Q, achieved different profiles. However, the procedure for Q mapped the original range of to one of . This leads also to an increase in the lengths of the intervals. In fact, the average length of an interval in R is 118.5 (minutes) so far, whereas in Q we find (minutes). With respect to the join performance, this difference could subsume any effect that is caused by the different profiles. In order to avoid this, we applied an additional procedure change_lengths() to bring and in line, namely to a value of 300 (minutes). The source code for change_lengths() is given in appendix C. Essentially, it randomly picks intervals and adds or deletes chronons from them until the desired average length is achieved. We will use this procedure also for controlling the experiments in section 10.5.

The final profiles for R and Q are respectively shown in figures 10.2 and 10.3. $i_{\scriptscriptstyle R}(t)$ has seven peeks corresponding to the daytime hours of the seven weekdays Monday, Tuesday, ..., Sunday. As one can expect, there are less accesses during Saturdays and Sundays: the two rightmost peeks are significantly lower than the previous ones. In contrast, describes the accesses during a day (if we ignore the values of the time axis for a moment): as one can expect, there is a sharp rise during the morning, with a little valley during lunch time. In the afternoon there is a second peek, followed by a sharp fall towards the evening. As we see from this interpretation of the profiles, there is a large number of factors that contribute to their shapes. This underlines the presence of a high statistical complexity that we can expect in many scenarios.

Table 10.1 summarises the main characteristics of R and Q. We will use R and Q as the base for the experiments; some of the parameters, however, will be varied such as and (section 10.5) or |R| and |Q| (section 10.6). We note that the parameters shown in table 10.1 approximately match those that were used for the uniform data experiments in section 8.5.2. Similarly, we assume the architectural parameters listed in table 10.2 which correspond to those in table 8.9. A parallel architecture with and and a single processor architecture ( and ) will be used in the experiments.

**Figure:** The periodic profile $i_{\scriptscriptstyle R}(t)$ of R.

**Figure:** The non-periodic profile of Q.

**Table:** The characteristics of the base relations R and Q.
Parameter	R	Q
size (in tuples)	\|R\| = 121728	\|Q\| = 121728
profile	periodic	non-periodic
lifespan (in minutes)	\|L(R)\| = 10080	\|L(Q)\| = 10080
(in minutes)
tuple size (in bytes)	\|r\| = 500	\|q\| = 500

**Table:** The parameters describing the architecture that is used in the experiments.
Parameter	Description	Value
	processor speed in MIPS	200 MIPS
	free main memory per node	32 MB
	disk I/O bandwidth per node	20 MB/sec
	communication bandwidth	40 MB/sec
	memory bandwidth per node	400 MB/sec
	number of CPU instructions for processing a tuple in each step	1000
	number of CPU instructions for hashing a tuple	1000
	number of CPU instructions for initiating a data transfer	500
	number of CPU instructions for initiating a disk I/O	500
b	page size	4 kB

Next: A General Comparison between Up: The Test Data Previous: Introduction

Thomas Zurek