The Multiprocessor Simulation Testbed

Contents


The Testbed

This document describes the simulation testbed developed to allow plug and play between different simulation models and workloads. The idea was to provide a clean interface between the two, and allow development of models and workloads to proceed independently.

The model and workload are separate programs, compiled into separate libraries in separate directories, and only linked together at the end to produce an executable simulation program. This rigid division of the two allows the mix and match process. For example, a bus model and a random workload could be combined using:

CC bus.a random.a  
and the bus model with a memory bank saturating workload using:
CC bus.a memsat.a  
The important features of the testbed are:

Interface specification (between models and workloads).

Allowing plug and play between different simulation models is the subject of a number of initiatives in simulation research, notably with the HLA (High Level Architecture) US DoD initiative. Similar concerns abound in the computer industry, from the Virtual Socket Interface used for mixing and matching cores onto a piece of silicon, to `software component' technology like CORBA/IIOP and JavaBeans. The crucial aspect is to get the interface right. Here a testbed interface for interconnection network simulations was developed. The model must provide N processors and N memories. The workload runs on each processor, and may also supply its own memory behaviour model. The interface defines how memory accesses, message passing and synchronisation take place.

The workload is called from the model, the interface being in the functions:-

void init_workload(p_fn &mf);
void do_workload( int argc, char **argv, int index, int nprocs );
init_workload must be called first by the model; it reads in the global parameters from the input.params file.

The workload is chosen using the workload parameter set in input.params, example values being:-

The processor calls do_workload to perform the work.

A workload-specific function to perform memory operations may be provided in mf, for example the cray model includes a memory unit which responds to cray specific memory operations.

The workload drives the model using the function:-

void do_cmd(int index, int cmd, int pno, int addr, int size, int *data);

This performs one of the following commands:-

Command Explanation
SEND_C, RECV_C Send and receive messages between processors.
READ_C, READ_R Read memory Command, and Response.
WRITE_C Write Command.
TST_SET_C Read-modify-write Command.
PROC_C Local processing Command.
USER_C User defined Command.
USER_R User defined Response.

Note that messages to memories are Commands (such as read requests); messages in response to commands are Responses.

Memory units may be driven using:-

int  read_mem(int index, int a);
void write_mem(int index, int a, int val);
void do_mem_cmd(int index, int cmd, int pno, int addr, int size, int *data);

Workload specific timings may be obtained using:

double get_time(int index);
And results may be placed on the trace file using:-
void put_result(int index, char *string );

Model and workload parameterisation.

Parameters for models and workloads are read from a file input.params which is read in on simulation startup. This is of the form value (name) as this cache parameter file indicates:
1024 (cache_size)
10 (cache_access_time)
5 (cache_tag_access_time)
10 (cache_upbus_time)
100 (cache_downbus_time)
1 (cache_upbus_width)
1 (cache_downbus_width)
2 (cache_assoc)
4 (cache_blk_size)
1 (cache_tagmems)

The simulation code accesses these parameters using code like:-

  double cache_size        = get_int_param(0,"cache_size");
  double cache_access_time = get_int_param(0,"cache_access_time");

Experimentation and graphing.

Support for experimentation is included in the testbed; more details are given in the experimentor user guide, and the code for the experimentation routines is part of the Hase++ simulation package.

Workloads

This section describes the workloads used to exercise the simulation models.

The workload is chosen using the workload parameter set in input.params.

A Random Workload

Each processor sends N fixed length messages to other processors, randomly selected.
Parameters
Behaviour
for (i=0; i<nmsgs; i++) {
  pno    = i % nprocs;
  do_cmd(index, SEND_C, pno, addr, size, data);
}

A Cray workload

This workload simulates a mix of the standard Cray network packets. Initially the workload is made up of remote read operations.

Parameters
Behaviour
Each processor performs N remote memory reads from different destinations.

The standard Cray T3D network operation PE_noncache_rd is used.

The memory is specialised to deal with the different remote memory operations supported by the Cray, i.e.:

enum { 
  PE_noncache_rd,  PE_noncache_rd_res,
  PE_wr_1word,     PE_wr_1word_res,
  PE_cache_rd,     PE_cache_rd_res,
  PE_wr_4word,     PE_wr_4word_res,
  BLT_rd_1word,    BLT_rd_1word_res,
  BLT_wr_1word,    BLT_wr_1word_res,
  BLT_blk_rd,      BLT_blk_rd_res,
  BLT_blk_wr,      BLT_blk_wr_res,
  pfetch_rd,       pfetch_rd_res,
  f_inc_rd,        f_inc_rd_res,
  f_inc_wr,        f_inc_wr_res,
  swap,            swap_res
};

Memory bank saturation

This workload includes a variable number of memory banks.
Parameters
Behaviour
for (i=0; i<nmsgs; i++) {
  pno    = index % memmodules;
  do_cmd(index, READ_C, pno, addr, size, data);
  do_cmd(index, PROC_C, 0, 0, 1, NULL);
}

Threaded workload

This workload performs a set of standard thread synchronisation operations.

Parameters
Behaviour
for (i=0; i<nmsgs; i++) {
  pno    = index % memmodules;
  do_cmd(index, READ_C, pno, addr, size, data);
  do_cmd(index, PROC_C, 0, 0, 1, NULL);
}

Followed by

  /* No contention mutex */
  /* Non local mutex with contention */
  /* Cross process mutex with contention */
  /* Mutex trylock */
  /* No contention reader lock */
  /* No contention writer lock */
  /* Reader trylock */
  /* Writer trylock */
  /* No contention semaphore */
  /* Semaphore trywait */
  /* Reference global variable */
  /* Get thread specific data */

The mutex tests are performed with the code:-

    // Get lock
    do {
      val = 1;
      do_cmd(index, TST_SET_C, mp, ma, 1, &val);
      tries ++;
    } while (val==0);
 
    // process
    do_cmd(index, PROC_C, 0, 0, 5, (int*)"got lock");
    
    // Release lock
    do_cmd(index, WRITE_C, mp, ma, 1, &zero_val);

Results

Cache test workload

This workload tests cache hit and miss operations.
Parameters
Behaviour
Fills the cache, performs nmsgs hits then nmsgs misses.
Results

Models

Several network models were constructed for the testbed. The network can be selected from a bus, a crossbar and a multistage network, cache models and memories are also included in the system.

The overall parameters and default values are:-

Value Parameter name
1 interconnect
1 usecache
10 nprocs
10 cmd_startup
5 cmd_perword
100 bus_startup
10 bus_perword
100 mem_startup
100 mem_perword
1024 cache_size
10 cache_access_time
5 cache_tag_access_time
10 cache_upbus_time
100 cache_downbus_time
1 cache_upbus_width
1 cache_downbus_width
2 cache_assoc
4 cache_blk_size
1 cache_tagmems

The Bus

The bus entity acts as a passive router of commands, holding for a message dependent time, then passing the command on. Only one command may occupy the bus at a time. Requests are dealt with on a First Come First Served basis, but no actual arbitration or priority is implemented apart from this. Its delay parameters are Bus hold time and Bus time per word.

Parameters Bus hold time. Bus time per word. Applies to all commands.
Operations Always holds for Bus hold time.
  • SEND. Passes on.
  • READ. Passes to memory.
  • WRITE. Passes to memory.
  • TEST_SET. Passes to memory.
  • USER. Passes requests to memory, responses to processors.
Behaviour Acts as a passive router of commands. Holds for a message dependent time, then passes the command on. Only one command may occupy the bus at a time. Requests dealt with on a FCFS basis. No actual arbitration or priority apart from this.

The Crossbar

The crossbar entity has similar basic delay parameters to the bus. It acts as a passive router of commands, waiting for each command to arrive and then passing it on (a message-dependent time) later. It never actually blocks itself.

Parameters Bus hold time. Bus time per word. Applies to all commands.
Operations
  • SEND. Passes on.
  • READ. Passes to memory.
  • WRITE. Passes to memory.
  • TEST_SET. Passes to memory.
  • USER. Passes requests to memory, responses to processors.
Behaviour Acts as a passive router of commands. Waits for each command to arrive, passes it on a message dependent time later. Never actually holds itself.

The Multistage Network

Parameters Bus hold time. Bus time per word. Applies to all commands.
Operations
  • SEND. Passes on.
  • READ. Passes to memory.
  • WRITE. Passes to memory.
  • TEST_SET. Passes to memory.
  • USER. Passes requests to memory, responses to processors.
Behaviour Acts as an array of crossbars.

Events from processors or memories are passed into the internal network of switches. When they emerge, they are sent on to the appropriate processor or memory.

The Processor Node

Parameters
  • Command hold time. Command time per word. Applies to all commands.
  • Send hold time. Send time per word.
  • Recv hold time. Recv time per word.
  • Read hold time. Read time per word.
  • Write hold time. Write time per word.
  • Process cycle time.
  • User hold time.
Operations Always holds for Command hold time.
  • SEND. Holds for send hold time, then sends to network instantly.
  • RECV. Holds for recv hold time, waits for posted send, returns instantly.
  • READ. Holds for read hold time. Send request. Waits for reply. Returns instantly.
  • WRITE. Holds for write hold time, posts write, returns.
  • TEST_SET. Holds for read hold time. Send request. Waits for reply.
  • PROCESS. Holds for process cycle time times number of words.
  • USER. Holds for user hold time. Send request. Waits for reply. Returns instantly.
Overall operation User code is in:
  void do_workload( argc, argv, index, nprocs );

This calls:

  void do_cmd(int cmd, int pno, int addr, int size, int *data);

do_cmd holds for the command hold time, then behaves as described in the operations section above.

All hold delays are modelled as startup plus time per word. There is no extra delay on a read request returning.

The Memory

Parameters
  • Mem hold time. Mem time per word. Applies to all commands.
Operations Always holds for Memory hold time.
  • READ. Passes back contents.
  • WRITE. Sets contents.
  • TEST_SET. Sets and returns contents.
  • USER. Calls user function.
Overall operation Holds for Memory hold time. Performs operation, returns instantly. For user commands, provides read and write methods which operate instantaneously. User delays may be inserted by calling the hold method.

The Cache

Parameters
  • Access time.
  • Tag access time. Number of tag memories.
  • Write policy.
  • size - number of words in cache.
  • blk_size - number of words in block.
  • up_width, down_width - width of up/down buses in words.
  • up_cycle_time, down_cycle_time - speed of up/down buses.
Operations Performs tag look up.
  • RD_HIT. Passes back contents.
  • WR_HIT. Sets contents. If shared, invalidate others.
  • RD_MISS. Read block.
  • WR_MISS. Read block. Write contents.
  • INV_ADDR. Invalidate block.
Overall operation Waits for requests from processor (``up''). Looks up tags. If hit, returns data. If miss, sends request down.

Also receives invalidate requests from down.

Holds for calculated time, then sends.


Results

Effect of cache on network performance

The following graphs show the effects of including a cache in the system for the different networks.


Bus

Total time for the workloads on a bus, with and without a cache.


Memory utilisations for the workloads on a bus, with and without a cache.


Crossbar

Total time for the workloads on a crossbar, with and without a cache.


Memory utilisations for the workloads on a crossbar, with and without a cache.


Multistage network

Total time for the workloads on a multistage network, with and without a cache.


Memory utilisations for the workloads on a multistage network, with and without a cache.


Summary graph


Total times for all workloads and networks, with and without cache.


Memory utilisations for all workloads and networks, with and without cache.


Scaleability measures

The following graphs show the effects of varying the number of processors.

Total time for variable numbers of processors and the same workload (no 5). Note the bus time grows linearly with number or processors, the crossbar time is constant, and the multistage network is slower for 4 processors than 8, 12 or 16. This apparently anomalous result is because of increased contention in the 4 processor case.

Average memory utilisation vs number of processors. Note the falloff for the bus, as the contention means that the memory units are not kept busy. Memory utilisation remains constant for the crossbar and multistage network (apart from the special case at 4 processors).

Timing diagrams.

Key for timing diagrams.


Top level timing diagram for 4 processors on a multistage network.

The timing diagram above shows the top level behaviour of a 4 processor multistage network system running workload 5. The one below shows the equivalent diagram for 8 processors.

Top level timing diagram for 8 processors on a multistage network.

The next two diagrams show the detail of the first few transactions. Note that the transaction time for the 8 processor case is shorter than for the 4 processor case as the traffic is split between two switches.

Top level timing diagram for 4 processors on a multistage network.

Top level timing diagram for 8 processors on a multistage network.


Summary

This section has described the interconnection network testbed in detail. The crucial aspect was the interface specification between workload and model. This interface was designed to be flexible, and a Cray T3D memory model was layered on top of it to demonstrate this flexibility. Results are available from the testbench in the form of graphs and timing diagrams; the scalability and effects of caches on three networks were measured for the different workloads. Future development of the model could add more sophisticated shared memory models and workloads; this extension could be incorporated into the existing structure.



Fred Howell
Last modified: Tue Juy 14 1998