Next: Summary, Conclusions and Future Up: Using IP-Tables for Selectivity Previous: Parallel and Other Partitioned

Summary

In this chapter, we have shown an analytical way of calculating temporal join result sizes or - respectively - temporal join selectivities. To our knowledge, there has only been one paper discussing the selectivity estimation for temporal joins [Segev et al., 1993]. Its approach requires that the statistical process that creates the timestamps is either well understood or follows certain standard probability distributions such as the Poisson distribution for interval startpoints or the Erlang-n distribution for interval lengths. The first case is quite rare: imagine the example of the distribution and lengths of telephone calls which depend on many statistical processes that are influenced by holidays, pricing, marketing or TV campaigns and even the weather. It is difficult to incorporate all these effects into a thorough statistical model for a query optimiser. In the second case, the assumptions can be erroneous for the same reasons.

In contrast to that, our technique is based on the information stored in IP-tables. For a set of elementary temporal joins, exact result sizes can be computed (section 11.3.1). For cases of temporal joins that arise from a composition of the elementary join conditions we gave the formulas (11.3) and (11.4). These allow to derive the result sizes of composite temporal joins from those of the elementary joins that are involved (section 11.3.2). Finally, we also provided a way to calculate result sizes of partial temporal joins that occur in parallel join processing (section 11.3.3).

The advantages of our analytical approach as opposed to statistical ones are

Most results are exact rather than estimations.
The calculations consider the fact that timestamps are often the result of a variety of interfering statistical processes.
They are also sensitive to the fact that these statistical processes can change over time (see phone calls example). This property, for example, allowed to derive the result sizes of partial joins in parallel processing.
It can be applied to all types of temporal data, regardless of the underlying semantics and its implications for statistical modeling. This means that one does not have to analyse the nature of the temporal data and the underlying statistical processes in order to be able to estimate result sizes but can work on a purely analytical basis regardless of the origin of the temporal data.

Next: Summary, Conclusions and Future Up: Using IP-Tables for Selectivity Previous: Parallel and Other Partitioned

Thomas Zurek