In this section, we look at the realistic size of an IP-table and how it compares to sizes of data samples that are used in a typical data sampling approach. The ratio of the sizes of an IP-table I(R) and its corresponding temporal relation R can be calculated by
(25) |
The ratio (7.3) has to be compared to ratios achieved when sampling data. To that end, we want to look at one such example, i.e. the approach taken in [Soo et al., 1994]. It uses the Kolmogorov test statistic [Conover, 1980] which is frequently employed in data sampling approaches for query optimisation, e.g. in [DeWitt et al., 1991]. The Kolmogorov test is non-parametric which means that it does not make any assumptions about the underlying distributions of the tuples. Soo et al. conclude that one has to draw a sample whose size is determined by
(26) |
The ratio |V(R)|:|R| shows how many new elements are contributed to V(R) on average by a tuple r's interval. A ratio of 0.5 indicates that two intervals contribute one new timepoint to V(R), in the case of 1.0 it is one interval adding one timepoint on average and the worst case is 2.0 with each interval introducing two new timepoints (its start- and endpoint) to the plot. Therefore |V(R)|:|R| is an indicator for observing whether there are many tuples in R that share interval start- and endpoints - in this case the ratio is low - or whether most intervals have start- and endpoints that do not appear in other intervals within R - in this case the ratio is high, reaching 2.0 in the worst case when each interval has a start- and an endpoint that does not appear either as a start- or an endpoint in any other interval. Some applications will impose a low ratio, e.g. in the case of a temporal relation holding air pollution figures that are obtained through periodic measurements. Here, many intervals share the timepoints of the measurements as their start- or endpoints. In other situations, such as a temporal relation storing start- and endtimes of telephone calls or computer accesses, we can expect the start- and endpoints of tuple intervals to be arbitrarily distributed over the timeline, therefore possibly causing a higher ratio than in periodic or other regular applications. In section 7.3.2, examples of various real-world temporal relations are analysed and the respective values of the |V(R)|:|R| ratio are given.
Table 7.1 shows typical values for (7.3) depending on the and the ratio |V(R)|:|R|. For most combinations we get values that are at least as good as those achieved by data samples. But recall that an IP-table provides precise information whereas the data sample approach achieves these figures only at the expense of introducing error margins which vary immensely with the sample size.
tuplesize | 8c| | |||||||
---|---|---|---|---|---|---|---|---|
in bytes | 0.25 | 0.50 | 0.75 | 1.00 | 1.25 | 1.50 | 1.75 | 2.00 |
100 | 3.5% | 7.0% | 10.5% | 14.0% | 17.5% | 21.0% | 24.5% | 28.0% |
200 | 1.8% | 3.5% | 5.3% | 7.0% | 8.8% | 10.5% | 12.3% | 14.0% |
300 | 1.2% | 2.3% | 3.5% | 4.7% | 5.8% | 7.0% | 8.2% | 9.3% |
400 | 0.9% | 1.8% | 2.6% | 3.5% | 4.4% | 5.3% | 6.1% | 7.0% |
500 | 0.7% | 1.4% | 2.1% | 2.8% | 3.5% | 4.2% | 4.9% | 5.6% |
600 | 0.6% | 1.2% | 1.8% | 2.3% | 2.9% | 3.5% | 4.1% | 4.7% |
700 | 0.5% | 1.0% | 1.5% | 2.0% | 2.5% | 3.0% | 3.5% | 4.0% |
800 | 0.4% | 0.9% | 1.3% | 1.8% | 2.2% | 2.6% | 3.1% | 3.5% |
900 | 0.4% | 0.8% | 1.2% | 1.6% | 1.9% | 2.3% | 2.7% | 3.1% |
1000 | 0.4% | 0.7% | 1.1% | 1.4% | 1.8% | 2.1% | 2.5% | 2.8% |