Next: Stage 2: Joining
Up: Cost Model
Previous: The Basic Issues
The stage - as described in section 8.3.3 -
comprises disk accesses, communication, CPU time and memory accesses
as the major cost factors. These costs arise for the repartitioning
of all participating relations. We restrict ourselves to deriving the
cost of the substages for one relation R; the other relations are
treated similarly:
- (a)
- Loading fragments of R from disk
This substage does not involve any communication or memory accesses.
Only disk accesses and the CPU costs for initiating these accesses
have to be modelled. We note that we will not distinguish between
random and sequential disk I/O because we can expect an equal mix of
random and sequential accesses when comparing the costs caused by the
various partitioning strategies. This goes along the argument that
was mentioned at the end of section 8.1, i.e. that we are
interested in a relative comparison rather than absolute figures.
However, we stress that a distinction between random and sequential
accesses would be more realistic and has therefore been frequently
made in the literature, e.g. in [Soo et al., 1994].
We assume a uniform distribution of R over the nodes, i.e. the
fragments are equally sized. They are stored on the disks of
the nodes. Therefore each node i has to move
tuples each of size |r| to its main memory. The disk
I/O bandwidth is which leads to
as the disk access costs whereby a portion of
is loaded by each processor. We assume that tuples are moved
blockwise from disk to the
processors. Thus a disk I/O has to be initiated only once per block (page).
If b is the size of such a page then
is the number of pages to be moved. The time spent on initiating
one page movement is
where is the number of microprocessor
instructions necessary; is the number of
instructions per second being performed by the processor. Thus
is the CPU time spent in substage 1 (a).
As mentioned in the previous section, we assume that disk I/O,
communication, CPU and memory access phases have a perfect overlap.
Therefore it is the maximum of the individual times that is finally
relevant. The total time spent on substage 1 (a) is therefore the
maximum of (8.10) and (8.11):
- (b)
- Redistribution of the data via the network, including
inter-node replication
Substage 1 (b) describes the distribution of the data between the
nodes. It hashes tuples to the hash buffers and initiates an
inter-node replication via the interconnect. Thus it comprises
communication, CPU and memory costs.
Each of the nodes has to deal with
tuples. These are distributed over the interconnect to other nodes
depending on their respective timestamp . We assume
that -th of the tuples are not sent over the interconnect but
remain at the node, i.e. a share of
of the tuples is actually involved in inter-node communication. The
data comprises not only the primary but also the replicated tuples.
We use the parameter to denote the
average number of times that a tuple of R is replicated on the basis
of an underlying partition P:
However, communication via the interconnect is only necessary for
inter-node replication, i.e. replication across node boundaries. A
node boundary appears every -th fragment Rk in the
beginning, later every -th fragment. On average this happens
every -th fragment if or every fragments
otherwise. This translates to node boundaries being encountered every
fragments on average. Thus, on average, a tuple is
replicated over node boundaries
times. In other words: whereas provides the average number
of fragments Rk in which a tuple has to be present, gives the average number of nodes over which these Rk are spread.
Hence each node sends
bytes. As each of the node sends this amount the total communication
costs are
where refers to the communication bandwidth. To
initiate the communication for a tuple transfer each processor is
supposed to perform instructions, thus it has
to spend
seconds per transfer. Similarly, the CPU costs for hashing a tuple to
a buffer are
if is the average number of CPU instructions
for the hashing. For all r hashed by a single processor there arise
CPU costs of
The total CPU costs are therefore
Finally, on each node there are
tuples to be moved to buffers in main memory. This causes memory
access costs of
per tuples with referring to the bandwidth for main
memory accesses. Thus
are the total costs for memory accesses per node. The total time spent
on stage 1 (b) is the maximum of (8.15),
(8.16) and (8.17):
- (c)
- Intra-node replication via main memory
Stage 1 (c) replicates tuples within a node. This can be done
via main memory. Originally, a node i had to cope with
tuples per node. Each tuple is replicated times on average.
If exceeds (the average number of fragments per
node) then most tuples are replicated over all processors of a node;
otherwise just times. Writing one tuple to memory creates
costs of
if is the memory bandwidth in bytes per second.
Thus the memory access costs for this stage are
CPU costs for these memory accesses can be neglected as they comprise
by far less instructions as computing expressions (e.g. as for hashing) or processing two tuples (see ). Thus
(8.19) states the total costs for stage 1 (c):
- (d)
- Writing new fragments of R to disk
Stage 1 (d) writes the new fragments R'1, R''1, R'2, R''2,
..., R'm, R''m to disk. A node i has to move
tuples to its disks which results in I/O costs of
However, as this is performed concurrently it is only the costs of
the node that takes longest to perform the task that are relevant. Thus
the I/O costs for this stage are
By analogy to stage 1 (a), this I/O has to be initiated by the CPUs
with each page movement requiring instructions. A processor j
moves
tuples to the disks of its node. Again, for the overall costs only the
processor that has the highest workload is relevant. This leads to
overall CPU costs of
The total costs for stage 1 (d) arise from the maximum of
(8.21) and (8.22):
The cost components of stage 1 are summarised in
tables 8.1, 8.2, 8.3
and 8.4. Relation Q has to be repartitioned in the
same way using the respective parameters |Q|, |q|, etc.
With equations (8.12), (8.18), (8.20) and
(8.23) we can derive the total costs of stage 1 as
with C1(a)(R) indicating that parameters of
relation R should be used for computing the costs of stage 1 (a).
Similarly, it has to be distinguished between R and Q in the other
substages.
Table:
Cost components for stage 1 (a).
2|c|Stage 1 (a) |
|
Disk I/O |
|
Communication |
|
CPU |
|
Memory |
|
Table:
Cost components for stage 1 (b).
2|c|Stage 1 (b) |
|
Disk I/O |
|
Communication |
|
CPU |
|
|
|
Memory |
|
Table:
Cost components for stage 1 (c).
2|c|Stage 1 (c) |
|
Disk I/O |
|
Communication |
|
CPU |
|
Memory |
|
Table:
Cost components for stage 1 (d).
2|c|Stage 1 (d) |
|
Disk I/O |
|
Communication |
|
CPU |
|
Memory |
|
Next: Stage 2: Joining
Up: Cost Model
Previous: The Basic Issues
Thomas Zurek