Load Balancing Stratgies to Implement Non-uniform Data Parallel Computations

Salvatore Orlando

The efficient implementation of data parallel loops on distributed memory multicomputers is a hot topic of research. To this end, data parallel languages such as HPF generally exploit static data layout and static scheduling of iterations. Unfortunately, when iteration execution costs vary considerably and are unpredictable, some processors may be assigned more work than others. Workload imbalance can be mitigated by cyclically distributing data and associated computations. Though this strategy often solves load balance issues, it may worsen data locality exploitation.

In this talk we present SUPPLE (SUPort for Parallel Loop Execution), an innovative run-time support for parallel loops with regular stencil data references and non-uniform iteration costs. SUPPLE relies upon a static block data distribution to exploit locality, and combines static and dynamic policies for scheduling non-uniform iterations. It adopts, as far as possible, a static scheduling policy derived from the owner computes rule, and moves data and iterations among processors only if a load imbalance actually occurs. SUPPLE always tries to overlap communications with useful computations by reordering loop iterations and prefetching remote ones in the case of workload imbalance.

The SUPPLE approach has been validated by many experiments conducted by running a multi-dimensional flame simulation kernel on a 64-node Cray T3D. We have fed the benchmark code with several synthetic input data sets built on the basis of a load imbalance model. We have obtained very interesting results: for example, for some data sets, the total overheads due to communications and the residual load imbalance only range between 0.88% and 1.69% of the optimal time. These results are better than those those obtained with a CRAFT Fortran (an HPF-like language) implementation of the benchmark.