Simple Temporal Hash Join

Next: Improved Temporal Hash Join Up: Explicit-Partitioning Join Algorithms Previous: Overview

Simple Temporal Hash Join

The symmetric partitioning approach, as outlined by equation (3.6)

$\begin{displaymath} R \Join_{\scriptscriptstyle C}Q \;=\; R_1 \Join_{\scriptscri... ..._1 \,\cup\, \cdots \,\cup\, R_m \Join_{\scriptscriptstyle C}Q_m\end{displaymath}$

can be adapted for processing temporal intersection joins. We assume that the partial joins are to be kept independent thus allowing concurrent processing. Then, the fragments cannot be created by hashing the tuples over their interval timestamps : assume that there was a hash function h such that intersecting intervals are assigned to the same fragment number. Assume also that there are at least two different fragments per relation to be created. Let [x_s,x_e] and [y_s,y_e] be two non-intersecting intervals, i.e. $x_s \le x_e < y_s \le y_e$ , which are assigned to two different fragments i and j by h, i.e.

$\begin{displaymath} h(x_s,x_e) = i \;\;\;\;\not=\;\;\;\; j = h(y_s,y_e)\end{displaymath}$

Now consider the interval [x_e,y_s] which should fall into fragment i because it intersects with [x_s,x_e] and also into fragment j because it intersects with [y_s,y_e]. Thus h would have to assign two different values, i and j, for [x_e,y_s] which contradicts its notion of being a function.

Nevertheless, we can employ range partitioning as a variation of hash partitioning. Here, tuples are partitioned over their interval timestamp in the following way: the time line is divided into m disjoint ranges which are numbered according to their order on the time line. If a relation, say R, is partitioned then a tuple $r \in R$ is put into R_k if its timestamp interval intersects with the k-th range. We note that a tuple can be put into several fragments because its timestamp interval might intersect with more than one time range.

In general, it is impossible to create disjoint fragments for interval data if a temporal relation R has a long `chain' of intervals, i.e. [r₁.t_s,r₁.t_e], ..., [r_n.t_s,r_n.t_e] such that subsequent intervals intersect, i.e.

$\begin{displaymath}[r_i.t_s,r_i.t_e] \cap [r_{i+1}.t_s,r_{i+1}.t_e] \not= \emptyset\end{displaymath}$

Then it is impossible to put these intervals into different fragments without assigning at least one of them to two different fragments: if [r₁.t_s,r₁.t_e] goes to fragment k then [r₂.t_s,r₂.t_e] has to go to fragment k, too, because it intersects with [r₁.t_s,r₁.t_e]. The same applies to [r₃.t_s,r₃.t_e] because it intersects with [r₂.t_s,r₂.t_e], and so forth. However, our intention is to partition the data in order to reduce the processing effort and to improve resource utilisation (disk I/O, main memory). It is therefore out of the question to put all the n intervals into the same fragment if n is very large; in general, it is likely that n is close or equal to |R|. In this case, the `chain' has to be broken into two or more parts which forces one or more intervals to be put into more than one fragment.

If tuple replication is tolerated as a necessary evil, one can apply the symmetric partitioning technique as described in section 3.5.2. Figure 4.8 shows the search strategy of the technique when the partial joins are processed as nested-loops. It also shows two major problems related to the processing overhead and the duplicates overhead :

Tuple replication causes repetitions of possible tuple comparisons. The net effect of this is the same as if the search space is increased. The fragments themselves are larger. In figure 4.8, 132 tuple comparisons are performed. When comparing this number with the 182 of the nested-loop join (figure 4.2) we recognise that the partitioning effort is not as successful as one might have hoped.
Some of the successful tuple comparisons are duplicated too. These cause duplicate tuples in the join result. In figure 4.8, these comparisons are marked in black. They occur, say, in the k-th partial join $R_k \Join_{\scriptscriptstyle C}Q_k$ , and are actually unnecessary because they are also performed by another partial join $R_i \Join_{\scriptscriptstyle C}Q_i$ with i<k. In many situations, duplicates have to be removed from the join result by expensive operations. This increases the overall overhead furthermore.

In the following section, we will tackle these problems by refining the simple approach presented here.

**Figure:** Search strategy for the simple partitioned temporal join (partial joins as nested-loops).
$\begin{figure} \epsfxsize=0.9\textwidth \centerline{ \epsffile{/home/tz/work/thesis/fig/tj-simple.ps}} \centerline{ } \end{figure}$

Next: Improved Temporal Hash Join Up: Explicit-Partitioning Join Algorithms Previous: Overview

Thomas Zurek