Next: Future Work Up: Summary, Conclusions and Future Previous: Summary

Conclusions

The principal contribution of this thesis is the elaboration and description of a novel way in which partitioned temporal join processing can be optimised. All parts of the optimisation process can be made very efficient by using IP-tables. If the algorithms, e.g. those presented in chapter 9, had to be implemented on top of a data sampling approach then they would be very inefficient as most of them would need to scan the data sample various times. The theoretical and experimental results provide a base as to how an optimisation module of a database management system can be enhanced to cope with partitioned temporal joins.

Apart from this major contribution there is a list of further important results that were obtained when investigating various aspects of this work. They are the following:

We designed a temporal hash algorithm that (a) avoids the duplicates overhead , (b) reduces replication by avoiding unnecessary tuple comparisons and (c) increases the opportunities for main memory join processing (see section 4.4.3).
We showed that the interval partitioning (IP) problem has a polynomial solution. Furthermore, it is related to graph partitioning which is a well investigated problem.
The IP-table has emerged from the analysis of the IP problem. It proved to be a very versatile metadata-structure that can describe the characteristics of the timestamp intervals that are found in one or more temporal relations. We used IP-tables mainly for interval partitioning purposes but showed that there is a wider scope for them: they can be used for estimating the selectivity of temporal join conditions too.
The IP-table based selectivity estimation has proved to be more generally applicable than statistical methods that have been proposed in the past. The latter require a thorough understanding of the, possibly complex, statistical processes that underly the temporal application. Our analytical approach does not.
The condensation of IP-tables is a very efficient way of reducing the sizes of the IP-tables, thereby accelerating the optimisation process. The experimental results on the impact of condensation on the quality of partitioning were very encouraging. Condensation factors a between 10 and 20 are feasible and hardly penalise the processing performance. This range of condensation factor values reduces an IP-table's size to below 0.5% of the corresponding relation's size (see table 7.1). This is far below the sizes of data samples that constitute an alternative to an IP-table based partitioning approach.
Naive, uniform partitioning results in very poor processing performances for parallel temporal joins. Costs can be up to three times higher than with more sophisticated techniques, such as the underflow and minimum-overlaps strategies. See figure 10.7 or table 10.6, for example.
On the single-processor machine, uniform partitioning proved to be a viable option as long as a large number m of breakpoints was chosen (see results in section 10.3).
For parallel temporal join processing, the experimental results show that a good load balance - even at the expense of an increased number of replicated tuples - is far more important than minimising replication. This can be seen from the fact that the primary underflow strategy produced better performances than the primary minimum-overlaps strategy in most situations on the parallel architecture. This result is contrary to our initial assumption. However, this is an encouraging result in the sense that partitioning over an interval attribute is not that severely penalised and thus can be an alternative to partitioning over an atomic attribute, e.g. if the latter's values are heavily skewed and would therefore cause a severe load imbalance.

However, there are several issues about this work which require careful consideration and possibly some more research in the future. One of these issues is the efficiency of the maintenance of the IP-tables. In section 7.4, we were concerned with showing how IP-tables can be updated. Thus there was an emphasis on feasibility rather than efficiency. Therefore the algorithms in that section do not claim to be the most efficient ones. In fact, one could imagine temporal database applications and query situations in which the overhead that is imposed by IP-table updates might become so significant that it outweighs the benefits of the IP-tables. It is still unclear, for example, whether an operational database with frequent updates to its (temporal) tables would significantly suffer from the IP-table overhead. Some more analysis in this quantitative aspect is required, either to discard this possibility or to assume that such situations might appear. In the latter case, one might want to find indicators that identify such problematic situations.

Our analytical cost model is a further issue which needs some validation. In the past, similar approaches have proved to be valuable for qualitative analysis, e.g. in [Hua et al., 1991]. However, one cannot say the same about the quantitative aspect. Modern hardware, especially parallel machines, have become systems that employ many complex performance enhancing mechanisms (such as caching or special devices to accelerate broadcasts or other typical communication patterns over the interconnect) which we could hardly incorporate into our cost model if we wanted to keep it reasonably general (to allow to derive conclusions for a wide range of platforms) and reasonably simple so that it could be efficiently used in a query optimiser. In other words: there is a good justification that if our cost model shows that strategy X performs better than strategy Y then this effect can be observed on a wide range of implementations. However, we still need some validation for the absolute numbers, i.e. if a strategy causes costs X according to our cost model then it remains to be seen how realistic this prediction is. But this could be confirmed by implementing the strategies and the join algorithm on real hardware or at least by simulating them using one of the available simulation tools.

Finally there is another issue that has to be considered carefully: the costs for the optimisation itself. We gave results of elapsed times for deriving the costs imposed by the various strategies. These elapsed times were obtained on a specific machine. For an optimiser it could be beneficial to have a cost model for the optimisation process of section 6.1 itself. In that way, it could decide whether it is worth while to consider expensive partitioning techniques, such as those of the minimum-overlaps family, or whether simple and fast ones are sufficient in the light of saving optimisation costs.

Next: Future Work Up: Summary, Conclusions and Future Previous: Summary

Thomas Zurek