Next: Integration into a Query Up: Optimisation of Partitioned Temporal Previous: Optimisation of Partitioned Temporal

Optimisation Process

In this chapter, we build a bridge between

(a): the analytical part of this thesis which is formed by chapters 2 to 5 and which introduces, motivates, defines and analyses the problem of processing partitioned temporal joins and
(b): the synthetical part which is formed by the following chapters and which is oriented towards a practically applicable and efficient solution for partitioned temporal join processing.

To that end, this chapter summarises the main results of the analysis of part (a) and uses these to design an approach to optimising temporal joins that is based on explicit partitioning. The elaboration of this approach will be presented in the following chapters.

First, we want to focus on the main conclusions that we can draw from what has been discussed so far. In chapter 2, the importance and significance of temporal databases for many applications has been motivated. One obstacle to the incorporation of temporal features into commercial products is the poor performance of operations involving temporal data. One performance critical operator is the temporal join. In chapter 3, we looked at algorithms that are traditionally used for the joins involving an equality join condition. This is the most frequent situation in conventional join processing and most algorithmic techniques have been tuned to perform well in these cases. In this context, explicit partitioning of the data has frequently proved to give the best performance results. In chapter 4, we analysed if and how the techniques that are used for processing equi-joins can be applied to processing temporal joins. In most cases, this transfer was straightforward. However, techniques that are based on explicit partitioning prove to be tricky: although they can still be expected to be amongst the most efficient, they impose a significant overhead as tuples have to be replicated between the relation fragments. The rate of tuple replication depends (i) on the characteristics of the temporal data and (ii) on the choice of the partition that is used for creating the fragments. While we cannot do anything about (i) we have seen that the choice (ii) of an appropriate partition is a delicate one. In chapter 5, we looked at this choice in more detail and analysed the complexity of the problem of finding an optimal partition. Optimal means that the partition should minimise the total number of tuple replications while creating fragments that do not exceed a certain maximum size. It was shown that this problem has a polynomial solution: there is an algorithm IP-opt with a run-time complexity of O(N²) where N is the number of different start- and endpoints occurring in the relation(s) that are to be partitioned. In practical terms, IP-opt is likely to be too inefficient as N is probably huge. From this evolves the need to have heuristic partitioning strategies which are more efficient with respect to the expense of creating only semi-optimal, rather than optimal, partitions.

Such heuristic partitioning strategies form part of a wider optimisation approach. The idea is that a query optimiser can choose the cheapest partition among those produced by various partitioning strategies. In order to determine which partition is the cheapest, we require a cost model of the respective temporal join processing technique. This cost model has to consider

characteristics of the temporal data (such as values $s_{\scriptscriptstyle R}(t)$ or $o_{\scriptscriptstyle R}(t)$ as used in chapter 5),
system parameters (such as the number of processing nodes, amount of free memory, current interconnect bandwidth etc.),
and the respective partition of the time domain.

Figure 6.1 summarises the approach that we propose. It shows the dataflow in the optimisation process for a partitioned temporal join between two temporal relations R and Q. Data is represented as rectangles, computation as ovals. The entire process consists of four stages corresponding to the four grey boxes in figure 6.1:

1.

Firstly, the temporal relations have to be analysed to acquire some information about the structure and characteristics of the temporal data. In its simplest form, this information can be represented by the temporal relations themselves. This is not very practical. Alternatively, a data sample can be drawn from the relations. This sample must be big enough to properly represent the characteristics of the data from which it was drawn. The necessary size of a data sample can be determined by the Kolmogorov test statistic [Conover, 1980]. We will return to this issue later. A data sampling approach has been used in the context of band-joins in [DeWitt et al., 1991] and for temporal joins in [Soo et al., 1994].

A further possibility is to get some meta-information on the data which might be stored in the database catalog. We follow this approach and define IP-tables for this purpose. Chapter 7 discusses them in more detail.

2.

Based on the information acquired in stage 1 and the systems parameters several strategies can be applied to find suitable partitions for the temporal data. Figure 6.1 assumes that there are three strategies to choose from; in practice there will be more. In chapter 9, we will design and propose several such heuristic partitioning strategies.

3.

Using the partitions and information on the temporal data, performance-determining parameters, such as the loads of the fragments R_k and Q_k, can be approximated (e.g. when using data samples) or exactly calculated (e.g. when using complete information on $s_{\scriptscriptstyle R}(t)$ , , $o_{\scriptscriptstyle R}(t)$ and ). These parameters are then fed into a cost model which derives the processing costs of the partitioned temporal join based on the respective partition and the current system parameters.

A crucial part in this stage is the performance model for the respective partitioned temporal join processing technique. In reality, there will be a lot of such techniques and, consequently, the same number of cost models. Frequently, these techniques will be adapted to a target (sequential or parallel) hardware platform. Consequently, there is no single and generally usable cost model for partitioned temporal join processing but several. In chapter 8, we will model the performance of a sequential and a parallel technique and try to be as general as possible with respect to assumptions about the underlying hardware. These two cost models will also be used for evaluation purposes in chapter 10.

4.

Finally, an optimisation decision can be taken and the cheapest partition is chosen.

When looking at the dataflow in figure 6.1 we note that the optimisation process itself is highly parallel: each partitioning strategy initiates an independent thread. Only the final stage is the point of synchronisation when the results of each thread are analysed and the optimisation decision is taken.

In the following chapters, we will use the following simplified version of figure 6.1 to guide you through the optimisation approach by highlighting the stage that is respectively dealt with.

**Figure:** Structure of the optimisation process.

Next: Integration into a Query Up: Optimisation of Partitioned Temporal Previous: Optimisation of Partitioned Temporal

Thomas Zurek