LAM 6.0 Performance

The Performance of LAM 6.0 and MPICH 1.0.12
on a Workstation Cluster

Nick Nevin
Ohio Supercomputer Center Technical Report OSC-TR-1996-4
Columbus, Ohio

This report provides some measurements of the performance of the LAM 6.0 and MPICH 1.0.12 implementations of Message Passing Interface (MPI).

A PostScript version of this report and other related papers are available from the LAM ftp site.

Test Conditions

The tests were run on a FDDI network cluster of eight DEC 3000/300 workstations running OSF/1 V3.2.

All benchmarks and both libraries were compiled with the standard DEC C compiler with -O optimization.

The MPICH MPI library was configured for the ch_p4 device. The default configuration was used apart from setting -O level compiler optimization and setting -nodevdebug in order to exclude debugging overhead.

The LAM MPI library was configured in the standard way with -O level compiler optimization.

All LAM tests used the -c2c, -nger and -O switches to mpirun. The first selects client-to-client mode in which the LAM library bypasses the daemon and clients communicate directly. The second turns off the Guaranteed Envelope Resources feature of LAM. The third informs the LAM/MPI library that the cluster is homogeneous and hence turns off data conversion.

No special run-time switches were used when running MPICH tests. The MPICH library detects upon initialization that the cluster is homogeneous.

The LAM and MPICH libraries differ in how they set up communication channels between MPI processes. In client-to-client mode LAM sets up a fully connected network at initialization time whereas MPICH makes connections on a demand driven basis. To ensure that connection setup time was not included in the tests, all the benchmark programs perform some communications before the timing phase in order to force the establishment of all the necessary connections.

The Benchmarks

A suite of six benchmark programs was used. In these programs all the MPI communications measured use datatype MPI_BYTE which ensures that neither message data conversion nor packing is done.

The ping and ping-pong tests measure non-blocking point-to-point communication performance. Both these tests are run in a MPI_COMM_WORLD of size two, each process on a separate node.

The barrier, broadcast, gather and alltoall tests measure the performance of the corresponding MPI collective communication functions. These tests are run in a MPI_COMM_WORLD of size eight, one process per node.

Timings were done with MPI_Wtime which in both libraries is implemented on top of the UNIX gettimeofday system call. Since the granularity of gettimeofday is not very fine timings are obtained by surrounding a loop of communications with calls to MPI_Wtime and dividing the difference of the times thus obtained by the number of iterations performed. We call this final measure of elapsed time an observation.

For each benchmark and for each data size considered we run an experiment in which 20 observations are measured as described above. The final data-point is then the mean of these twenty observations.

In this report we present in graphical form the mean times over the 20 observations for each experiment. All times are given in seconds.

The raw data includes the mean, standard deviation, minimum and maximum of the 20 observations.

Ping

In this test one process is run on each of two nodes from the cluster. Process rank 0 loops calling MPI_Send with destination rank 1. Process rank 1 loops calling MPI_Recv with source rank 0.

LAM 6.0 client-to-client mode by default uses a short message protocol on messages up to 8192 bytes in length. It switches over to a long message protocol for longer messages. By default MPICH changes protocol at 16384 bytes. The effect of the LAM protocol can be seen quite clearly here and in the ping-pong benchmark. The maximum length of a short message can be changed in both implementations at compile time by setting the appropriate constant.

Ping-Pong

This test is similar to the ping test except that here the two processes both send and receive. Process rank 0 loops calling MPI_Send with destination rank 1 followed my MPI_Recv from source node 1. Process rank 1 loops calling MPI_Recv with source rank 0 followed by MPI_Send with destination rank 0.

Barrier

In this test one process is run on each of the eight nodes in the cluster. Each process loops calling MPI_Barrier. The time reported is for process rank 0.

LAM
mean: 0.005185
st.dev: 0.000957

MPICH
mean: 0.007268
st.dev: 0.000189

Broadcast

This benchmark is designed along the lines of the methodology described in [1]. One MPI process is run on each of the eight nodes in the cluster. For each non-root process in the broadcast we time at the root a loop of broadcasts followed by the receive of a zero length message from the non-root process which does a loop of broadcasts followed immediately by the send of a zero length message to the root. The maximum time taken over all the non-root processes then gives an estimate of the maximum time time taken by any process participating in the broadcast. This maximum time is what is shown in the graph plotted against the byte count which refers to the size of the data sent by the root to each process. The timings for the individual leaf processes can be found in complete data listing.

Note that the MPICH implementation does no communication for a data size of zero. LAM does not check for this special case and the root sends zero length messages.

Gather

In this test one process is run on each of the eight nodes in the cluster. Each process loops calling MPI_Gather with root 0. The time reported is for process rank 0 and the byte count refers to the size of the data sent by each process to the root.

Alltoall

In this test one process is run on each of the eight nodes in the cluster. Each process loops calling MPI_Allgather. The time reported is for process rank 0 and the byte count refers to the size of the data sent by each process to each other process.

References

Nupairoj, Natawat and Lionel M. N. "Benchmarking of Multicast Communication Services", Technical Report MSU-CPS-ACS-103, Michigan State University, April 1 1995.

The Performance of LAM 6.0 and MPICH 1.0.12 on a Workstation Cluster