Data-parallel programming: Can we have high-level languages *and*
high-performance?
Jan F. Prins
Nested arrays and nested data-parallelism combine to
provide a uniquely expressive mechnism for the specification
of irregular parallel computations. The flattening transformation
first described by Blelloch in 1990 provides a technique
to realize all nested parallelism to within a constant factor
of that specified without increasing total work and with
perfect load balance in execution. The price for all these
advantages is that the absolute performance and performance
predictability fall precipitously in the presence of non-uniform
memory access costs, and this appears, in some sense,
to be inherent.
Yet uniform memory access (UMA) on real machines can be guaranteed
through a bulk-reference condition that is easily satisfied
by flattened parallel programs. Current commercial machines
(such as the Cray and NEC parallel vector processors) provide
UMA for vector operations. Using these machines and the
flattening technique, some important irregular applications,
such as the solution of unstructured sparse linear systems,
may achieve the highest absolute performance currently
available.