Next: Performance Model Up: IP-Tables Previous: Merging Complete and Incomplete

Histograms and IP-Tables

Query optimisers in a DBMS employ statistical profiles of the data. Such statistical profiles are complex objects that contain quantitative descriptors. One form of descriptor is a histogram. They form part of the family of non-parametric methods to estimate the frequency distribution of attribute values [Mannino et al., 1988].

In order to elaborate similarities between IP-tables and histograms we want to introduce types of histograms by an example. Let us consider the distribution of values of an attribute `age' in figure 7.20: the first column contains the attribute values, the second the frequency of that value as it appears in the `age' attribute of some relation, and the third column contains the cumulative of the frequencies. Instead of storing the entire and precise frequency distribution (as shown in columns 1 and 2 of figure 7.20) it is reasonable to store the frequencies for ranges of values. The result of this process is called a histogram. Essentially, there are three types of histograms that have been proposed in the literature:

Equal-width histograms
These histograms use equally sized value ranges, i.e. value buckets of equal widths. Figure 7.21 shows an equal-width histogram for the example of figure 7.20. It uses ranges that comprise five of the original attribute values each.
Equal-height histograms
These histograms use ranges such that each frequency is the same. In the example of figure 7.20 there are 100 values. If we assume that we want to create four buckets then each bucket should contain 25 values. We can imagine that (theoretically) this is done by sorting the relation over the age attribute. Then four buckets are created by putting the first 25 tuples into the first bucket, the next 25 into the second and so forth. Thus the first bucket would have ages 20 to 28, the second would only have ages of 28, the third ages between 29 and 34 and the final one ages from 34 to 39 (see figure 7.22). For advantages of equal-height over equal-width histograms please refer to [Piatetsky-Shapiro and Connell, 1984] or [Mannino et al., 1988].
Variable-width histograms
Several researchers suggested that widths are set so that the values within each bucket are approximately uniformly distributed. This improves the accuracy of the selectivity estimations. Figure 7.23 shows a variable-width histogram for the example of figure 7.20.

Now we want to compare histograms and IP-table. To that end, we can consider $s_{\scriptscriptstyle R}(t)$ as a frequency distribution of the startpoints of timestamp intervals occuring in relation R. $o_{\scriptscriptstyle R}(t)$ can be regarded as an overlap frequency distribution. In that sense, a complete IP-table corresponds to two frequency distributions (that of $s_{\scriptscriptstyle R}(t)$ and that of $o_{\scriptscriptstyle R}(t)$ ), similar to the one shown in figure 7.20 for the atomic age attribute. The condensation process for IP-tables compares to the creation of value ranges - these correspond to the collapsed timepoints - of equal widths as the same number of timepoints are collapsed into one each time. Accordingly, we sum up the frequencies $s_{\scriptscriptstyle R}(t)$ for the individual ranges. However, we cannot treat the $o_{\scriptscriptstyle R}(t)$ values as frequencies in this case. As we have seen in the discussion of condensation (section 7.3.3) we keep the $o_{\scriptscriptstyle R}(t)$ value for the maximum t value of each range.

In summary this means that there are obvious similarities between IP-tables and histograms. In fact, one can consider to apply some of the methods for condensing or compressing frequency distributions into histograms to (complete) IP-tables too. However, many of the compressing methods for histograms were designed having the usage of histograms for selectivity estimation in mind. The type of variable-width histogram that we described above is an obvious example for that. However, the main purpose of IP-tables - at least in the context of this work - is partitioning rather than selectivity estimation. Therefore one needs to consider carefully whether the compressing methods that are beneficial in the case of selectivity estimation are equally favourable in the case of partitioning. A second important issue is that ranges (or buckets) for histograms are created in order to compress one single frequency distribution. In the case of IP-tables one has to consider two frequency distribution which in itself have to be treated differently as outlined in the previous paragraph.

The relationship between IP-tables and histograms and the possible application of histogram techniques is certainly an interesting topic for future research. Analysing this relationship here would lead away from our main goal which is to investigate the suitability of IP-tables for efficiently supporting the optimisation of partitioned temporal join processing.

Recent years have seen an extension of histograms beyond atomic data types. One example is this use of histograms in the context of processing multimedia data, such as in [Gong et al., 1996] or [Ng and Tam, 1997]. Other developments focus on specific purposes for which histograms are used. This has led to a variety of histogram types which goes beyond the three basic types that we have outlined above. [Poosala et al., 1996] provide a taxonomy for previously and recently proposed histograms. Furthermore there are papers that look at several aspects of histogram processing, such as error propagation [Ioannidis and Christodoulakis, 1993] or histogram maintenance [Gibbons et al., 1997]. As we already mentioned above, we can expect several results of this research being valuable in the context of IP-tables too.

**Figure:** An example of an attribute value frequency distribution.

**Figure:** An equal-width histogram for the distribution of figure 7.20.

**Figure:** An equal-height histogram for the distribution of figure 7.20.

**Figure:** A variable-width histogram for the distribution of figure 7.20.

Next: Performance Model Up: IP-Tables Previous: Merging Complete and Incomplete

Thomas Zurek