Next: Preliminaries Up: The Interval Partitioning Problem Previous: The Interval Partitioning Problem

Introduction

In this chapter, we analyse the problem of replication when partitioning a collection of intervals. As we have seen in section 4.4, this problem is relevant for the temporal join algorithms that are based on explicit partitioning of the data over the timestamp attribute.

The replication of tuples can have an impact on the performance of the join algorithms in a variety of ways. The exact consequences for the join costs, however, can only be determined by considering characteristics that are algorithm- or hardware-specific:

Sequential join algorithms are affected differently than parallel ones.
Similarly, it will always be a matter of the underlying hardware, e.g. in the case of a parallel join:
- Is it a shared-memory architecture that might not require the tuples to be replicated physically but only logically or is it a shared-nothing machine which will need the tuples to be physically replicated?
- Are there any specific hardware components that improve certain communication patterns (e.g. fast broadcasts) that will be particularly useful for the respective algorithm?

This means that a partition that minimises the costs for one particular join algorithm running on one particular hardware might not do so well when applied to a different algorithm, possibly running on a different hardware. In this chapter, we want to avoid such situations and therefore we adopt a more general view. A partition is to be considered as optimal if it keeps fragments' sizes below a maximum value, while minimising the number of intervals that overlap the breakpoints. This does not necessarily minimise the join processing costs. Nevertheless, it will be beneficial for every algorithm running on any piece of homogeneous hardware to have such an optimal partition .

The problem of finding such optimal partitions for interval data is called the interval partitioning problem (IP) . This is a rather tricky problem as it not only requires to choose appropriate breakpoints but also makes it delicate to determine the number of breakpoints. In fact, we have encountered a dilemma: in order to reduce the number of overlapping intervals one has to reduce the number of breakpoints but in order to increase the number m of partial joins in (3.6) - e.g. for increasing the degree of parallelism - one has to increase the number of breakpoints. Thus there are two contrary effects associated with interval partitioning; note that the first one does not exist in the case of an equi-join. This is a manifestation of the min-max dilemma [Zhou, 1993]. We therefore need an additional input parameter. A useful constraint is to form fragments with less than a certain number of intervals. Such a number, for example, could be implied by the amount of memory or the disk space that is available.

In this chapter, we will formally investigate the complexity of IP, i.e. we will check whether a solution for IP can be found in polynomial time and - if so - we would like to have an algorithm that can compute this solution. The remainder of this chapter is structured like this: Section 5.2 introduces the notation that we are going to use throughout the chapter and the remaining parts of the thesis. In section 5.3, IP is defined formally. Section 5.4 looks at the search space of the problem. The main result will be that optimal partitions can be found within the set of interval endpoints. In section 5.5, we describe the algorithm IP-opt that computes solutions for IP in polynomial time. Finally, section 5.6 shows an alternative approach by relating IP to a graph-theoretic problem, namely the problem of sequential graph partitioning (SGP). This leads to a similar result as manifested by IP-opt. However, the connection between IP and graph partitioning (GP) might prove to be quite fruitful as the many complexity and algorithmic results about GP might be applied to variations of IP.

Next: Preliminaries Up: The Interval Partitioning Problem Previous: The Interval Partitioning Problem

Thomas Zurek