Next: Future Work
Up: Summary, Conclusions and Future
Previous: Summary
The principal contribution of this thesis is the elaboration and
description of a novel way in which partitioned temporal join
processing can be optimised. All parts of the optimisation process
can be made very efficient by using IP-tables. If the algorithms,
e.g. those presented in chapter 9, had to be implemented
on top of a data sampling approach then they would be very inefficient
as most of them would need to scan the data sample various times. The
theoretical and experimental results provide a base as to how an
optimisation module of a database management system can be enhanced to
cope with partitioned temporal joins.
Apart from this major contribution there is a list of further
important results that were obtained when investigating various
aspects of this work. They are the following:
- We designed a temporal hash algorithm that (a) avoids the
duplicates overhead , (b) reduces
replication by avoiding unnecessary tuple comparisons and (c)
increases the opportunities for main memory join processing (see
section 4.4.3).
- We showed that the interval partitioning (IP) problem has
a polynomial solution. Furthermore, it is related to graph partitioning
which is a well investigated problem.
- The IP-table has emerged from the analysis of the IP problem. It
proved to be a very versatile metadata-structure that can describe the
characteristics of the timestamp intervals that are
found in one or more temporal relations. We used IP-tables mainly for
interval partitioning purposes but showed that there is a wider scope
for them: they can be used for estimating the selectivity of temporal
join conditions too.
- The IP-table based selectivity estimation has proved to be
more generally applicable than statistical methods that have been
proposed in the past. The latter require a thorough understanding
of the, possibly complex, statistical processes that underly the
temporal application. Our analytical approach does not.
- The condensation of IP-tables is a very efficient
way of reducing the sizes of the IP-tables, thereby accelerating the
optimisation process. The experimental results on the impact of
condensation on the quality of partitioning were very encouraging.
Condensation factors a between 10 and 20 are feasible and hardly
penalise the processing performance. This range of condensation factor
values reduces an IP-table's size to below 0.5% of the corresponding
relation's size (see table 7.1). This is far
below the sizes of data samples that constitute an alternative to an
IP-table based partitioning approach.
- Naive, uniform partitioning results in very poor processing
performances for parallel temporal joins. Costs can be up to three times
higher than with more sophisticated techniques, such as the underflow
and minimum-overlaps strategies. See figure 10.7 or
table 10.6, for example.
- On the single-processor machine, uniform partitioning proved
to be a viable option as long as a large number m of breakpoints
was chosen (see results in section 10.3).
- For parallel temporal join processing, the experimental results
show that a good load balance - even at the
expense of an increased number of replicated tuples - is far more
important than minimising replication. This can be seen from the fact
that the primary underflow strategy produced better performances than
the primary minimum-overlaps strategy in most situations on the
parallel architecture. This result is contrary to our initial
assumption. However, this is an encouraging result in the sense that
partitioning over an interval attribute is not that severely penalised
and thus can be an alternative to partitioning over an atomic
attribute, e.g. if the latter's values are heavily skewed and would therefore cause a severe load imbalance.
However, there are several issues about this work which require
careful consideration and possibly some more research in the future.
One of these issues is the efficiency of the maintenance of the
IP-tables. In section 7.4, we were concerned with showing
how IP-tables can be updated. Thus there was an emphasis on
feasibility rather than efficiency. Therefore the algorithms in that
section do not claim to be the most efficient ones. In fact, one could
imagine temporal database applications and query situations in which
the overhead that is imposed by IP-table updates might become so
significant that it outweighs the benefits of the IP-tables. It is
still unclear, for example, whether an operational database with
frequent updates to its (temporal) tables would significantly suffer
from the IP-table overhead. Some more analysis in this quantitative
aspect is required, either to discard this possibility or to assume
that such situations might appear. In the latter case, one might want
to find indicators that identify such problematic situations.
Our analytical cost model is a further issue which needs some
validation. In the past, similar approaches have proved to be
valuable for qualitative analysis, e.g. in [Hua et al., 1991]. However,
one cannot say the same about the quantitative aspect. Modern
hardware, especially parallel machines, have become systems that
employ many complex performance enhancing mechanisms (such as caching
or special devices to accelerate broadcasts or other typical
communication patterns over the interconnect) which we could hardly
incorporate into our cost model if we wanted to keep it reasonably
general (to allow to derive conclusions for a wide range of platforms)
and reasonably simple so that it could be efficiently used in a query
optimiser. In other words: there is a good justification that if our
cost model shows that strategy X performs better than strategy Y
then this effect can be observed on a wide range of implementations.
However, we still need some validation for the absolute numbers, i.e. if a strategy causes costs X according to our cost model then it
remains to be seen how realistic this prediction is. But this could
be confirmed by implementing the strategies and the join algorithm on
real hardware or at least by simulating them using one of the
available simulation tools.
Finally there is another issue that has to be considered carefully:
the costs for the optimisation itself. We gave results of elapsed
times for deriving the costs imposed by the various strategies. These
elapsed times were obtained on a specific machine. For an optimiser
it could be beneficial to have a cost model for the optimisation
process of section 6.1 itself. In that way, it could decide
whether it is worth while to consider expensive partitioning
techniques, such as those of the minimum-overlaps family, or whether
simple and fast ones are sufficient in the light of saving
optimisation costs.
Next: Future Work
Up: Summary, Conclusions and Future
Previous: Summary
Thomas Zurek