Getting Started with MPI on LAM

LAM is a simple yet powerful environment for running and monitoring MPI applications on clusters. The few essential steps in LAM operations are covered below.

Booting LAM

The user creates a file listing the participating machines in the cluster.
% cat lamhosts
# a 2-node LAM
tbag.osc.edu
alex.osc.edu
Each machine will be given a node identifier (nodeid) starting with 0 for the first listed machine, 1 for the second, etc.

The recon tool verifies that the cluster is bootable.

% recon -v lamhosts
recon: testing n0 (tbag.osc.edu)
recon: testing n1 (alex.osc.edu)
The lamboot tool actually starts LAM on the specified cluster.
% lamboot -v lamhosts

LAM 6.0 - Ohio Supercomputer Center

hboot n0 (tbag.osc.edu)...
hboot n1 (alex.osc.edu)...
lamboot returns to the UNIX shell prompt. LAM does not force a canned environment or a "LAM shell". The tping command builds user confidence that the cluster and LAM are running.
% tping -c1 N
  1 byte from 2 nodes: 0.009 secs

Compiling MPI Programs

Refer to "MPI: It's Easy to Get Started" to see a simple MPI program. Hcc (hf77) is a wrapper for the C (F77) compiler that links LAM libraries. The MPI library is explicitly linked.
% hcc -o foo foo.c -lmpi
% hf77 -o foo foo.f -lmpi

Executing MPI Programs

A MPI application is started by one invocation of the mpirun command. A SPMD application can be started on the mpirun command line.
% mpirun -v n0-1 foo
2445 foo running on n0 (o)
361 foo running on n1
An application with multiple programs must be described in an application schema, a file that lists each program and its target node(s).
% cat appfile
# 1 master, 2 slaves
master n0
slave n0-1

% mpirun -v appfile
3292 master running on n0 (o)
3296 slave running on n0 (o)
412 slave running on n1

Monitoring MPI Applications

The full MPI synchronization status of all processes and messages can be displayed at any time. This includes the source and destination ranks, the message tag, count and datatype, the communicator, and the function invoked.
% mpitask
TASK (G/L)           FUNCTION      PEER|ROOT  TAG    COMM   COUNT   DATATYPE
0/0 master           Recv          ANY        ANY    WORLD  1       INT
1 slave              <running>
2 slave              <running>
Process rank 0 is blocked receiving a message consisting of a single integer from any source rank and any message tag, using the MPI_COMM_WORLD communicator. The other processes are running.
% mpimsg
SRC (G/L)      DEST (G/L)     TAG     COMM     COUNT    DATATYPE    MSG
0/0            1/1            7       WORLD    4        INT         n0,#0
Later, we see that a message sent by process rank 0 to process rank 1 is buffered and waiting to be received. It was sent with tag 7 using the MPI_COMM_WORLD communicator and contains 4 integers.

Cleaning LAM

All user processes and messages can be removed, without rebooting.
% lamclean -v
killing processes, done      
sweeping messages, done      
closing files, done      
sweeping traces, done

Terminating LAM

The wipe tool removes all traces of the LAM session on the network.
% wipe -v lamhosts
tkill n0 (tbag.osc.edu)...
tkill n1 (alex.osc.edu)...

LAM / MPI Parallel Computing / Ohio Supercomputer Center / lam@tbag.osc.edu