LAM Overview

LAM (Local Area Multicomputer) is an MPI programming environment and development system for heterogeneous computers on a network. With LAM, a dedicated cluster or an existing network computing infrastructure can act as one parallel computer solving one problem.

LAM features a full implementation of Message-Passing Interface. Compliant applications are source code portable between LAM and any other implementation of MPI. In addition to meeting the standard in a high quality manner, LAM offers extensive monitoring capabilities to support debugging. Monitoring happens on two levels. LAM has the hooks to allow a snapshot of process and message status to be taken at any time during an application run. The status includes all aspects of synchronization plus datatype map / signature, communicator group membership and message contents. On the second level, the MPI library is instrumented to produce a cummulative record of communication, which can be visualized either at runtime or post-mortem.

The Foundation

LAM runs on each computer as a single UNIX daemon uniquely structured as a nano-kernel and hand-threaded virtual processes. The nano-kernel component provides a simple message-passing, rendez-vous service to local processes. Some of the in-daemon processes form a network communication subsystem, which transfers messages to and from other LAM daemons on other nodes. Inter-node packet exchange is via a scalable UDP-based protocol. The network subsystem adds features like packetization and buffering to the base synchronization. Other in-daemon processes are servers for remote capabilities, such as program execution and file access. The layering is quite distinct: the nano-kernel has no connection with the network subsystem, which has no connection with the servers. Users can configure in or out services as necessary.

Application management and debugging are provided by a uniquely structured extensible daemon.

The unique software engineering of LAM is transparent to users and system administrators, who only see a conventional daemon. System developers can de-cluster the daemon into a daemon containing only the nano-kernel and several full client processes. This developer's mode is still transparent to users but exposes LAM's highly modular components to simplified individual debugging. It also reveals LAM's evolution from Trollius, which ran natively on scalable multicomputers and joined them to the UNIX network through a uniform programming interface.

The network layer in LAM is a documented, primitive and abstract layer on which to implement a more powerful communication standard like MPI (PVM has also been implemented).

MPI Tools

An important feature of LAM is hands-on control of the multicomputer. There is very little that cannot be seen or changed at runtime. In particular, unreceived messages and the synchronization state of processes can be examined with the mpimsg and mpitask tools. The key synchronization variables: source rank, destination rank, tag and communicator, are all displayed. Within a communicator, an opaque MPI object, the user can examine the group membership. Within a datatype, the type map and signature are exposed. Additionally, the contents of a message can be displayed. Again, these capabilities are available at any moment during an application's run.

Message queues and blocked processes can be examined at runtime.

The mpirun tool finds and loads the program(s) which constitute the application. A simple SPMD application can be specified on the mpirun command line while a more complex configuration is described in a separate file, called an application schema.

MPI Library

A strength of this MPI implementation is the movement of non-blocking communication requests, including those that result from buffered sends. This is the real challenge of implementing MPI; everything else is mostly a straight forward wrapping of an underlying communication mechanism. LAM / MPI allows messages to be buffered on the source end in any state of progression, including partially transmitted packets. This capability leads to great portability and robustness.

The sophisticated message-advancing engine at the heart of the MPI library uses only a handful of routines to drive the underlying communication system. Runtime flags decide which implementation of these low-level routines is used, so recompilation is not necessary. The default implementation uses LAM's network message-passing subsystem, including its buffer daemon. In this "daemon" mode, LAM's extensive monitoring features are fully available. The main purpose of daemon-based communication is development, but depending on the application's decomposition granularity and communication requirement, it may also be entirely adequate for production execution.

MPI communication can bypass the LAM daemon for peak performance.

The other implementation of the MPI library's low-level communication intends to use the highest performance underlying mechanism, certainly bypassing the LAM daemon and connecting directly between application processes. This is the "client to client" mode (C2C).

The availability of optimal C2C implementations will continue to change as architectures come and go. At the least, LAM includes a TCP/IP implementation of C2C that bypasses the LAM daemon.

Guaranteed Envelope Resources

Applications may fail, legitimately, on some implementations but not others due to an escape hatch in the MPI Standard called "resource limitations". Most resources are managed locally and it is easy for an implementation to provide a helpful error code and/or message when a resource is exhausted. Buffer space for message envelopes is often a remote resource (as in LAM) which is difficult to manage. An overflow may not be reported (as in some other implementations) to the process that caused the overflow. Moreover, interpretation of the MPI guarantee on message progress may confuse the debugging of an application that actually died on envelope overflow.

LAM advertises and guarantees a minimum level of internal resources for message delivery.

LAM 6.0 has a property called "Guaranteed Envelope Resources" (GER) which serves two purposes. It is a promise from the implementation to the application that a minimum amount of envelope buffering will be available to each process pair. Secondly, it ensures that the producer of messages that overflows this resource will be throttled or cited with an error as necessary.

A minimum GER is configured when LAM is built.

Process Spawning

A group of MPI processes can collectively create a group of new processes. An intercommunicator is established for communication.

Dynamic Nodes and Fault Tolerance

An initial set of LAM nodes are started with the lamboot tool. Afterwards, nodes can be added or deleted with the lamgrow and lamshrink tools. Thus, a resource manager can adjust the resources allocated to a LAM session at runtime.

Both LAM nodes and MPI processes can be dynamically added and removed at runtime in a dynamic resource environment.

The unplanned removal of a node due to a machine crash or other fault is detected by LAM and handled in much the same way as a planned adjustment with lamshrink. All surviving application processes are informed, via asynchronous signal, that a node has been removed. The MPI library reacts by invalidating all communicators that include processes on the dead node. Pending communication requests are marked with an error and future attempts to use invalid communicators also raise an error. The application can detect these errors, free the invalid communicators and possibly create new processes.

Installation and Session Initiation

LAM installs anywhere and uses the shell's search path at all times to find LAM and application executables. A multicomputer is specified as a simple list of machine names in a file, which LAM uses to verify access (recon), start the environment (lamboot), and remove it (wipe).

LAM / MPI Parallel Computing / Ohio Supercomputer Center / lam@tbag.osc.edu