Parallel and Other Partitioned Joins

Next: Summary Up: Size and Selectivity Calculations Previous: Composite Joins

Parallel and Other Partitioned Joins

In this thesis, we have been focusing on temporal joins that are processed by symmetrically partitioning the participating relations in order to create a number of smaller and independent joins. This is based on the algebraic expression

$\begin{displaymath} R \Join_{\scriptscriptstyle C}Q \;=\; R_1 \Join_{\scriptscri... ..._1 \,\cup\, \cdots \,\cup\, R_m \Join_{\scriptscriptstyle C}Q_m\end{displaymath}$ (67)

where equally indexed fragments R_k and Q_k are created in such a way that they hold tuples whose timestamps can possibly join with each other ( $k=1,\dots,m$ ). In the case of the temporal intersection join , for example, this means that an R_k holds those tuples whose timestamps intersect with a certain range of the timeline; the corresponding Q_k is created in the same way.

With respect to join selectivity, this creates a new problem: a fragment R_k represents a number of tuples of R that have been selected according to certain rules. Therefore, an R_k does not necessarily have the same statistical properties as R. Thus, if we want to estimate the join selectivity of a partial join $R_k \Join_{\scriptscriptstyle C}Q_k$ we cannot use statistical approaches such as in [Segev et al., 1993] because they assume the same statistical properties for all fragments. To see why this is not necessarily the case, consider again the phone calls example mentioned in sections 2.1, 7.3.1 and 9.1.4. A fragment R_k can, for example, hold phone calls made during Christmas time or during a period with a promotional offer of cheap rates or during any other type of period which causes a different consumer behaviour. This means that R_k is likely to have widely different properties than many other R_j with or R itself. However, one has to keep in mind that the partial joins in (11.6) are processed in parallel. Therefore it is the most expensive one among these that determines the overall performance. Thus, if there is at least one partial join whose result size ``gets out of hand'' for the reasons mentioned above then this translates into an immediate performance penalty. In order to avoid such a situation one requires a method that can cope with statistical properties that vary over time.

Our analytical approach can tackle this problem. Consider, for example, a partial temporal intersection join where R_k and Q_k hold tuples from R and Q, respectively, whose timestamps intersect with a time period that runs from time x to time y. Let

i.e. j_x and j_y are the indices of the start- or endpoints within the set that are closest to x and y, respectively, but inside the range between x and y, i.e. .

At the beginning, at time x, we have those tuples whose timestamps start before time x. There are and of these in R_k and Q_k respectively. All their timestamps intersect. Thus all their combinations are in the result of . Furthermore and similar to the derivation of equation (11.5), there are those tuples in R_k whose timestamps start between x and y. These join with those tuples in Q_k that intersect the respective startpoint. Therefore the result size of the partial intersection join is given by

Next: Summary Up: Size and Selectivity Calculations Previous: Composite Joins

Thomas Zurek