The Size of an IP-Table

In this section, we look at the realistic size of an IP-table and how it compares to sizes of data samples that are used in a typical data sampling approach. The ratio of the sizes of an IP-table I(R) and its corresponding temporal relation R can be calculated by

(25)

with being the size of an entry in the IP-table and referring to the size of a tuple of R.

The ratio (7.3) has to be compared to ratios achieved when sampling data. To that end, we want to look at one such example, i.e. the approach taken in [Soo et al., 1994]. It uses the Kolmogorov test statistic [Conover, 1980] which is frequently employed in data sampling approaches for query optimisation, e.g. in [DeWitt et al., 1991]. The Kolmogorov test is non-parametric which means that it does not make any assumptions about the underlying distributions of the tuples. Soo et al. conclude that one has to draw a sample whose size is determined by

(26)

with errorsize being the number of buffer pages that are provided for keeping an overflow of tuples in the buffer. This overflow can be caused by the error difference between the data characteristics of the sample and that of the entire data. Therefore one has to provide a certain buffer space to cope with such an overflow situation. Soo et al. optimise errorsize in order to minimise the accumulated costs of sampling and joining the two relations. However, the algorithm determinePartIntervals provided for that in [Soo et al., 1994] is erroneous as it always reaches the extreme case of drawing the entire relation as a sample. For that reason and in order to get an idea of an actual sample size, we assume the ratio

The ratio |V(R)|:|R| shows how many new elements are contributed to V(R) on average by a tuple r's interval. A ratio of 0.5 indicates that two intervals contribute one new timepoint to V(R), in the case of 1.0 it is one interval adding one timepoint on average and the worst case is 2.0 with each interval introducing two new timepoints (its start- and endpoint) to the plot. Therefore |V(R)|:|R| is an indicator for observing whether there are many tuples in R that share interval start- and endpoints - in this case the ratio is low - or whether most intervals have start- and endpoints that do not appear in other intervals within R - in this case the ratio is high, reaching 2.0 in the worst case when each interval has a start- and an endpoint that does not appear either as a start- or an endpoint in any other interval. Some applications will impose a low ratio, e.g. in the case of a temporal relation holding air pollution figures that are obtained through periodic measurements. Here, many intervals share the timepoints of the measurements as their start- or endpoints. In other situations, such as a temporal relation storing start- and endtimes of telephone calls or computer accesses, we can expect the start- and endpoints of tuple intervals to be arbitrarily distributed over the timeline, therefore possibly causing a higher ratio than in periodic or other regular applications. In section 7.3.2, examples of various real-world temporal relations are analysed and the respective values of the |V(R)|:|R| ratio are given.

Table 7.1 shows typical values for (7.3) depending on the and the ratio |V(R)|:|R|. For most combinations we get values that are at least as good as those achieved by data samples. But recall that an IP-table provides precise information whereas the data sample approach achieves these figures only at the expense of introducing error margins which vary immensely with the sample size.

**Table:** IP-table sizes as percentages of the original relation.
tuplesize	8c\|
in bytes	0.25	0.50	0.75	1.00	1.25	1.50	1.75	2.00
100	3.5%	7.0%	10.5%	14.0%	17.5%	21.0%	24.5%	28.0%
200	1.8%	3.5%	5.3%	7.0%	8.8%	10.5%	12.3%	14.0%
300	1.2%	2.3%	3.5%	4.7%	5.8%	7.0%	8.2%	9.3%
400	0.9%	1.8%	2.6%	3.5%	4.4%	5.3%	6.1%	7.0%
500	0.7%	1.4%	2.1%	2.8%	3.5%	4.2%	4.9%	5.6%
600	0.6%	1.2%	1.8%	2.3%	2.9%	3.5%	4.1%	4.7%
700	0.5%	1.0%	1.5%	2.0%	2.5%	3.0%	3.5%	4.0%
800	0.4%	0.9%	1.3%	1.8%	2.2%	2.6%	3.1%	3.5%
900	0.4%	0.8%	1.2%	1.6%	1.9%	2.3%	2.7%	3.1%
1000	0.4%	0.7%	1.1%	1.4%	1.8%	2.1%	2.5%	2.8%