Next: Conclusions Up: Summary, Conclusions and Future Previous: Summary, Conclusions and Future

Summary

We now give a short summary of the main issues that have been discussed in this thesis.

In chapter 2, we motivated the significance and importance of interval data, especially in the context of temporal databases. The usage of interval data does not work well with many traditional performance-enhancing methods, such as indexes, or performance-critical algorithms, such as those that are traditionally used for the join operation. The latter were the focus of chapter 3 which gave an overview over the huge number of algorithms that have been designed and proposed in the past. Many of these are tuned to perform well with equi-join conditions, as these are the most frequent ones in conventional database processing. In chapter 4, we analysed how these algorithms can be adapted to process temporal joins. Those join algorithms that are not based on explicit and symmetric partitioning hardly required any changes. Hash and parallel join processing have gained an increased significance with the advent of parallel database systems in recent years, especially in the context of data warehousing and data mining. However, these techniques are based on explicitly and symmetrically partitioning the relations that participate in a join. Partitioning interval data is different from partitioning atomic data in the sense that it results in non-disjoint relation fragments due to intervals that overlap the partition's breakpoints. This can cause three types of overhead:

a replication overhead,
a processing overhead, and
a duplicates overhead

In a first step, we adapted the hash and parallel join technique which now avoids the duplicates overhead and reduces the other two. We suspected that there is a major potential of further improving a partitioned temporal join's performance by carefully choosing the partition and thereby reducing the number of replicated tuples. In chapter 5, we looked at this issue from a theoretical point of view. The interval partitioning (IP) problem was defined and its complexity was analysed. The result was that there is an algorithm that computes an optimal solution in polynomial time. We could also relate IP to the sequential graph partitioning (SGP) problem. This confirmed the initial result and enables us in the future to take advantage of the many theoretical and practical results that have been obtained in the context of graph partitioning. The analytical part of the thesis was concluded in chapter 6 and the results were merged into the design of a process for the optimisation of partitioned temporal joins which consists of four stages (see figure 6.1):

a data analysis, in which the temporal data is analysed in order to derive the characteristics of its timestamp intervals,
a synthesis of partitions, in which several partitioning strategies create several partition candidates for the scenario,
an analysis of partitions, in which the partition candidates are analysed with respect to their performance impact, and
an optimisation decision that decides on the most convenient partition for processing the respective temporal join.

These stages were elaborated in the following chapters.

In chapter 7, IP-tables were introduced as a new type of metadata-structure. The information that is stored in an IP-table allows us to determine many parameters that influence the processing performance. We looked at several issues that concern the usage of IP-tables, such as their sizes and measures by which they can be decreased, the process of merging two or more IP-tables into one and finally how the information in the IP-tables can be maintained.

In chapter 8, we created a performance model for temporal join processing. This was divided into three steps. First, we created a model of the hardware architectures on which parallel and sequential database systems might run. It had to be general enough to embrace the large number of differing architectures that have been proposed and introduced. In a second step, we described the model in which a temporal join is processed on top of the architectural model. Finally, we were able to create a very detailed cost model for processing a partitioned temporal join in parallel or sequentially. It takes advantage of the data that is provided by the IP-tables of the participating relations.

In chapter 9, three families of partitioning strategies were described: the uniform, underflow and minimum-overlaps strategies. All of them can be efficiently implemented on the base of IP-tables. Each strategy emphasises a certain goal, such as simplicity or reducing one or more performance-critical parameters. There are many possible variations of these strategies. We presented one such possibility which is based on preprocessing the IP-tables that are involved in the partitioning process and cutting out possibly bad breakpoint candidates (black-out preprocessing).

In chapter 10, a thorough evaluation of the optimisation process was provided. A parallel and a single-processor architecture were used. The experiments indicated advantages and disadvantages of the partitioning strategies and also gave useful information on how to choose suitable values for the input parameters of the strategies.

Finally, in chapter 11, we showed that IP-tables not only suit for partitioning purposes but have a much wider scope. They can be used to exactly calculate or, at least, to estimate the sizes of temporal join result. Such information is required by many query optimising modules for taking optimisation decisions. The main advantage of the IP-table based approach is that it does not require a deep insight into the, possibly complex, statistical properties of the underlying temporal application. Such an insight is necessary for the selectivity estimation methods that have previously been proposed.

Next: Conclusions Up: Summary, Conclusions and Future Previous: Summary, Conclusions and Future

Thomas Zurek