Improved Temporal Hash Join

Next: Partitioned Temporal Join for Up: Explicit-Partitioning Join Algorithms Previous: Simple Temporal Hash Join

Improved Temporal Hash Join

Above, we identified successful but replicated tuple comparisons as a major problem of the simple approach. A first measure could be to extend the tuple comparisons by checking if the respective comparison occurs somewhere else. This avoids duplicates in the result but does not avoid the actual unnecessary tuple comparisons. Thus, it still produces a processing overhead. Consequently, replicated tuple comparisons should be avoided all together.

For that purpose, we observe the following: consider two tuples $r \in R$ and $q \in Q$ with respective timestamps [r.t_s,r.t_e] and [q.t_s,q.t_e]. Assume that the timestamps intersect and therefore that r and q are compared with each other at some stage of the computation. If r.t_s falls in the i-th and q.t_s in the j-th range of the time domain then they are compared in the partial join $R_{\max\{i,j\}} \Join_{\scriptscriptstyle C}Q_{\max\{i,j\}}$ for the first time. Possibly, they are compared again in subsequent partial joins.

In order to avoid this, a fragment, say R_k, can be divided into two disjoint subfragments, R'_k and R''_k. R'_k holds the tuples whose timestamp startpoint falls into the k-th range - these are called the primary tuples - and R''_k holds tuples that are already in some fragment R_j with j<k - these are called the replicated tuples , i.e. A partial join then looks like this

However, the join $R''_k \Join_{\scriptscriptstyle C}Q''_k$ comprises exactly those unnecessary tuple comparisons that we have identified above; figure 4.9 illustrates this for the partial join $R_3 \Join_{\scriptscriptstyle C}Q_3$ of the example of figure 4.8. Processing can therefore be restricted to the first three joins in (4.2). We note that this restriction applies only when the entire join $R \Join_{\scriptscriptstyle C}Q$ is computed; for getting the result of $R_k \Join_{\scriptscriptstyle C}Q_k$ we still require $R''_k \Join_{\scriptscriptstyle C}Q''_k$ because

$\begin{displaymath} R'_k \Join_{\scriptscriptstyle C}Q'_k \,\cup\, R'_k \Join_{\... ...C}Q'_k \;\; \subsetneq \;\; R_k \Join_{\scriptscriptstyle C}Q_k\end{displaymath}$

if $R''_k \not= \emptyset$ and $Q''_k \not= \emptyset$ . We note that $R''_k \overset <4096\gt\gt\mbox{\scriptsize int}{{\Join}Q''_k \;=\; R''_k \Join_{\text{true}} Q''_k$ due to the definition of R''_k and Q''_k. In total, $R \Join_{\scriptscriptstyle C}Q$ can be decomposed as shown in figure 4.10. This leads to the search strategy shown in figure 4.11.

**Figure:** Illustration of equation (4.2) for $R_3 \Join_{\scriptscriptstyle C}Q_3$ in figure 4.8.
$\begin{figure} \epsfxsize=0.9\textwidth \centerline{ \epsffile{/home/tz/work/thesis/fig/tj-r3.ps}} \centerline{ } \end{figure}$

**Figure:** Improved partitioning for computing a temporal intersection join.
$\begin{figure} \begin{center} \fbox {$R \Join_{\scriptscriptstyle C}Q$} \\ \vs... ..._{\scriptscriptstyle C}Q'_m$\end{center}\end{minipage}} \end{center}\end{figure}$

**Figure:** Search strategy for the improved partitioned temporal join.
$\begin{figure} \epsfxsize=0.9\textwidth \centerline{ \epsffile{/home/tz/work/thesis/fig/tj-imp-1.ps}} \centerline{ } \end{figure}$

A second optimisation is based on the following observation: if the remaining three joins in (4.2) are processed sequentially in one of the orders

1.: $R'_k \Join Q''_k \;,\; R'_k \Join Q'_k \;,\; R''_k \Join Q'_k$ or
2.: $R''_k \Join Q'_k \;,\; R'_k \Join Q'_k \;,\; R'_k \Join Q''_k$

and R'_k and Q'_k fit, respectively, into the local main memory of the processing node, then we can avoid unnecessary accesses to secondary storage. We can, however, enforce this situation because the sizes of the R'_k and Q'_k are much easier to predict and to control than those of the R''_k and Q''_k (and consequently those of R_k and Q_k): each tuple of R and Q appears exactly in one R'_k and Q'_k but might appear in several R''_k or Q''_k, respectively. The assignment of a tuple to a certain R'_k (Q'_k resp.) only depends on the value of the startpoint of the timestamp . This, however, is nothing else than partitioning atomic values as in the case of an equi-join. In the simplest case, the number of fragments m can be chosen high enough to fit the R'_k and Q'_k into main memory. More sophisticated techniques might be necessary if the startpoint values are heavily skewed , see e.g. [Hua and Lee, 1991] or [DeWitt et al., 1992].

Next: Partitioned Temporal Join for Up: Explicit-Partitioning Join Algorithms Previous: Simple Temporal Hash Join

Thomas Zurek