Stage 1: Repartitioning

Next: Stage 2: Joining Up: Cost Model Previous: The Basic Issues

Stage 1: Repartitioning

The stage - as described in section 8.3.3 - comprises disk accesses, communication, CPU time and memory accesses as the major cost factors. These costs arise for the repartitioning of all participating relations. We restrict ourselves to deriving the cost of the substages for one relation R; the other relations are treated similarly:

(a)

Loading fragments of R from disk This substage does not involve any communication or memory accesses. Only disk accesses and the CPU costs for initiating these accesses have to be modelled. We note that we will not distinguish between random and sequential disk I/O because we can expect an equal mix of random and sequential accesses when comparing the costs caused by the various partitioning strategies. This goes along the argument that was mentioned at the end of section 8.1, i.e. that we are interested in a relative comparison rather than absolute figures. However, we stress that a distinction between random and sequential accesses would be more realistic and has therefore been frequently made in the literature, e.g. in [Soo et al., 1994].

We assume a uniform distribution of R over the nodes, i.e. the fragments are equally sized. They are stored on the disks of the nodes. Therefore each node i has to move

tuples each of size |r| to its main memory. The disk I/O bandwidth is which leads to

(44)

as the disk access costs whereby a portion of

is loaded by each processor. We assume that tuples are moved blockwise from disk to the processors. Thus a disk I/O has to be initiated only once per block (page). If b is the size of such a page then

is the number of pages to be moved. The time spent on initiating one page movement is

where is the number of microprocessor instructions necessary; is the number of instructions per second being performed by the processor. Thus

(45)

is the CPU time spent in substage 1 (a).

As mentioned in the previous section, we assume that disk I/O, communication, CPU and memory access phases have a perfect overlap. Therefore it is the maximum of the individual times that is finally relevant. The total time spent on substage 1 (a) is therefore the maximum of (8.10) and (8.11):

(46)

(b)

Redistribution of the data via the network, including inter-node replication

Substage 1 (b) describes the distribution of the data between the nodes. It hashes tuples to the hash buffers and initiates an inter-node replication via the interconnect. Thus it comprises communication, CPU and memory costs.

Each of the nodes has to deal with

tuples. These are distributed over the interconnect to other nodes depending on their respective timestamp . We assume that -th of the tuples are not sent over the interconnect but remain at the node, i.e. a share of

of the tuples is actually involved in inter-node communication. The data comprises not only the primary but also the replicated tuples. We use the parameter to denote the average number of times that a tuple of R is replicated on the basis of an underlying partition P:

However, communication via the interconnect is only necessary for inter-node replication, i.e. replication across node boundaries. A node boundary appears every -th fragment R_k in the beginning, later every -th fragment. On average this happens every -th fragment if or every fragments otherwise. This translates to node boundaries being encountered every fragments on average. Thus, on average, a tuple is replicated over node boundaries

(47)

times. In other words: whereas provides the average number of fragments R_k in which a tuple has to be present, gives the average number of nodes over which these R_k are spread. Hence each node sends

bytes. As each of the node sends this amount the total communication costs are

(48)

where refers to the communication bandwidth. To initiate the communication for a tuple transfer each processor is supposed to perform instructions, thus it has to spend

seconds per transfer. Similarly, the CPU costs for hashing a tuple to a buffer are

if is the average number of CPU instructions for the hashing. For all r hashed by a single processor there arise CPU costs of

The total CPU costs are therefore

(49)

Finally, on each node there are

tuples to be moved to buffers in main memory. This causes memory access costs of

per tuples with referring to the bandwidth for main memory accesses. Thus

(50)

are the total costs for memory accesses per node. The total time spent on stage 1 (b) is the maximum of (8.15), (8.16) and (8.17):

(51)

(c)

Intra-node replication via main memory

Stage 1 (c) replicates tuples within a node. This can be done via main memory. Originally, a node i had to cope with

tuples per node. Each tuple is replicated times on average. If exceeds (the average number of fragments per node) then most tuples are replicated over all processors of a node; otherwise just times. Writing one tuple to memory creates costs of

if is the memory bandwidth in bytes per second. Thus the memory access costs for this stage are

(52)

CPU costs for these memory accesses can be neglected as they comprise by far less instructions as computing expressions (e.g. as for hashing) or processing two tuples (see ). Thus (8.19) states the total costs for stage 1 (c):

(53)

(d)

Writing new fragments of R to disk

Stage 1 (d) writes the new fragments R'₁, R''₁, R'₂, R''₂, ..., R'_m, R''_m to disk. A node i has to move

tuples to its disks which results in I/O costs of

However, as this is performed concurrently it is only the costs of the node that takes longest to perform the task that are relevant. Thus the I/O costs for this stage are

(54)

By analogy to stage 1 (a), this I/O has to be initiated by the CPUs with each page movement requiring instructions. A processor j moves

tuples to the disks of its node. Again, for the overall costs only the processor that has the highest workload is relevant. This leads to overall CPU costs of

(55)

The total costs for stage 1 (d) arise from the maximum of (8.21) and (8.22):

(56)

The cost components of stage 1 are summarised in tables 8.1, 8.2, 8.3 and 8.4. Relation Q has to be repartitioned in the same way using the respective parameters |Q|, |q|, etc. With equations (8.12), (8.18), (8.20) and (8.23) we can derive the total costs of stage 1 as

with C_1(a)(R) indicating that parameters of relation R should be used for computing the costs of stage 1 (a). Similarly, it has to be distinguished between R and Q in the other substages.

**Table:** Cost components for stage 1 (a).
2\|c\|Stage 1 (a)
Disk I/O
Communication
CPU
Memory

**Table:** Cost components for stage 1 (b).
2\|c\|Stage 1 (b)
Disk I/O
Communication
CPU

Memory

**Table:** Cost components for stage 1 (c).
2\|c\|Stage 1 (c)
Disk I/O
Communication
CPU
Memory

**Table:** Cost components for stage 1 (d).
2\|c\|Stage 1 (d)
Disk I/O
Communication
CPU
Memory

Next: Stage 2: Joining Up: Cost Model Previous: The Basic Issues

Thomas Zurek