This report provides some measurements of the performance of the LAM 6.0 and MPICH 1.0.12 implementations of Message Passing Interface (MPI).
A PostScript version of this report and other related papers are available from the LAM ftp site.
All benchmarks and both libraries were compiled with the standard DEC C compiler with -O optimization.
The MPICH MPI library was configured for the ch_p4 device. The default configuration was used apart from setting -O level compiler optimization and setting -nodevdebug in order to exclude debugging overhead.
The LAM MPI library was configured in the standard way with -O level compiler optimization.
All LAM tests used the -c2c, -nger and -O switches to mpirun. The first selects client-to-client mode in which the LAM library bypasses the daemon and clients communicate directly. The second turns off the Guaranteed Envelope Resources feature of LAM. The third informs the LAM/MPI library that the cluster is homogeneous and hence turns off data conversion.
No special run-time switches were used when running MPICH tests. The MPICH library detects upon initialization that the cluster is homogeneous.
The LAM and MPICH libraries differ in how they set up communication channels between MPI processes. In client-to-client mode LAM sets up a fully connected network at initialization time whereas MPICH makes connections on a demand driven basis. To ensure that connection setup time was not included in the tests, all the benchmark programs perform some communications before the timing phase in order to force the establishment of all the necessary connections.
The ping and ping-pong tests measure non-blocking point-to-point communication performance. Both these tests are run in a MPI_COMM_WORLD of size two, each process on a separate node.
The barrier, broadcast, gather and alltoall tests measure the performance of the corresponding MPI collective communication functions. These tests are run in a MPI_COMM_WORLD of size eight, one process per node.
Timings were done with MPI_Wtime which in both libraries is implemented on top of the UNIX gettimeofday system call. Since the granularity of gettimeofday is not very fine timings are obtained by surrounding a loop of communications with calls to MPI_Wtime and dividing the difference of the times thus obtained by the number of iterations performed. We call this final measure of elapsed time an observation.
For each benchmark and for each data size considered we run an experiment in which 20 observations are measured as described above. The final data-point is then the mean of these twenty observations.
In this report we present in graphical form the mean times over the 20 observations for each experiment. All times are given in seconds.
The raw data includes the mean, standard deviation, minimum and maximum of the 20 observations.
LAM 6.0 client-to-client mode by default uses a short message protocol on messages up to 8192 bytes in length. It switches over to a long message protocol for longer messages. By default MPICH changes protocol at 16384 bytes. The effect of the LAM protocol can be seen quite clearly here and in the ping-pong benchmark. The maximum length of a short message can be changed in both implementations at compile time by setting the appropriate constant.
MPICH
mean: 0.007268
st.dev: 0.000189
Note that the MPICH implementation does no communication for a data size of zero. LAM does not check for this special case and the root sends zero length messages.