Microbenchmarking

What is microbenchmarking?
Measurement
Performance prediction using spreadsheets
Simulation
Summary
Return to Contents page

What is microbenchmarking?

The idea of the `benchmarking' approach is to take a standard application and measure its performance on a range of machines. The value of this is that it provides a fixed point for comparing performance. The problem with it (from the point of view of system designers or people wanting to do performance prediction) is that the performance of a system running one application gives no clues as to the underlying reasons for the given performance; for this more detailed measurements are needed.

Microbenchmarking provides these measurements; rather than give one number for the performance, each of the primitive operations making up the performance is measured. This is a complementary process to standard benchmarking, but provides more detailed performance information.

In this application `microbenchmarks' can be seen as the performance characterisation of the interface layer between hardware and software of multiprocessor networks. Routines are descirbed for measuring shared memory and message passing performance, as well as calculating `microbenchmarks' analytically. The use of microbenchmarks in simulation is also described, both as the output of low level hardware models and as the input to high level software models.

Measurement

The `primitive operations' of a parallel machine depend on the programming model; the two most important models are message passing and shared memory.

Message passing

A message passing interface consists of a set of communications functions such as 'send', 'receive', 'broadcast'. The microbenchmark characterisation chosen consists of a table giving the performance of each communication function. The web page provides more details (and source code) of the measurement routines for the standard message passing interface MPI. An example is also available; the table below shows the microbenchmark obtained by running a characterisation routine across a network of workstations (clicking on the function name displays a graph of the data and curve fit used).

MPI Function	Time (us)			Goodness of fit (Q)
send	5000+	0* d		1
ssend	10000+	0* d		1
rsend	5000+	4* d		1
recv	8000+	0* d		1
recvmin	3000+	80* d		0.35
irecv1	70+	0* d		0.98
irecv2	9000+	0* d		0.99
irecvoverlap	5000+	30* d		0.88
sendrecv	10000+	50* d		1
pingpong	10000+	50* d		1
alltoall	0 +	4000* p^2 +	20* d	1
gather	0 +	30000* log(p) +	0* p^2 * d	1
allgather	0 +	9000* p +	40* d	1
reduce	0 +	20000* log(p) +	9* d	1
allreduce	0 +	9000* p +	30* d	1
bcast	1000 +	600* p^2 +	1* p^2 * d	1

These routines were originally developed for software performance prediction; more details are available in the User Guide, the performance characterisation chapter . from (Fred Howell, Approaches to Parallel Performance Prediction, PhD Thesis, Dept. of Computer Science, University of Edinburgh, 1996). and Reverse Profiling, a paper presented at ICSE, 1st International Workshop on Parallel and Distributed Software Engineering, Berlin, March 1996. Describes the MPI timing routines and a way of using the characterisation to generate performance predictions of MPI programs using the MPI profiling interface.

However, precisely because they are useful for software performance prediction, they are also useful as an output from simulation models.

Shared memory

Microbenchmark measuring routines were also developed for shared memory architectures, including the Solaris and POSIX threads library (for multiprocessor workstations), the Cray SHMEM library, and the Sequent shared memory model. The characterisation takes the form of measuring the performance of synchronisation operations (such as barriers, semaphores, process creation and joining), as well as the thornier issue of measuring the performance of the shared memory system.

Characterising the performance of the memory system is difficult because of the number of levels in the hierarchy, and because of the possibility of both network and memory bank interference between different memory accesses.

Measuring the low level memory access times from a high level benchmark program is an intricate task. There are many potential sources of error. The following sections describe the techniques used for measurements on the Sequent Symettry programming model, POSIX and Solaris threads on multiprocessor workstations, the Cray T3D shared memory operations and NT threads.

Routines common to all

To measure memory access times for local and remote data, it is essential to know whether the data is stored in the local cache or in main memory. This requires devious programming. The basis of the measurements is a data buffer, int buf[csize]; with a size csize greater than the cache size.

A single pass through this array, reading every element, will flush out the old contents of the cache. If the array is at least twice the size of the cache, subsequent passes through the array will suffer continuous cache misses.

In addition to the cache size, the block size is also a parameter. The worst case involves stepping through buf with a step size equal to the block size.

It is possible to determine the cache size and the block size experimentally but this is not easy, since the times involved are so short, and there is the danger of process switching interfering with the timings. So these figures are left as parameters.

It is most attractive to make the measurements using a compiled language like C++, but the measurements are at the mercy of the code generator, so it is essential to check that the generated assembler contains only the required memory references.

The loop overheads interfere with the timings. This may be measured using a ``control'' empty loop. The impact may also be reduced by ``unrolling'' the loop to contain a sequence of memory operations.

The actual function used to measure the time introduces another variable. The timer resolution is typically 1us or the clock rate of the processor, and the overhead of calling the timer may be severe.

The loop to time reading .nwords. words from main memory (unrolled eight times) is:-

    double e1 = MPI_Wtime();
    for (int j=0; j<nwords; j+=8, index += 8*blk_size) {
      accum += buf[index];
      accum += buf[index+blk_size*2];
      accum += buf[index+blk_size*3];
      accum += buf[index+blk_size*4];
      accum += buf[index+blk_size*5];
      accum += buf[index+blk_size*6];
      accum += buf[index+blk_size*7];
    }
    time = MPI_Wtime() - e1;

The time returned will be composed of:

{time to read nwords} + {loop overhead} + {timer overhead} 
+ {indexing and adding overhead} + e

where .e. is a random error caused by context switching/virtual memory paging.

In order to extract the required time to read nwords, it is either necessary to subtract the overheads, or to ensure that they form an insignificant portion of the time. The indexing and adding overhead is small compared to the memory access time, and is required in most programs.

The above loop must be repeated (as shown below) for long enough to minimise the one off timer overhead (and to compensate for the timer's lack of resolution). This must be done with care to avoid spurious cache hits.

  double e1 = MPI_Wtime();
  for (int i=0; i<iters; i++) {
    for (int j=0; j<nwords; j+=8, index += 8*blk_size) {
      accum += buf[index];
      accum += buf[index+blk_size*2];
      accum += buf[index+blk_size*3];
      accum += buf[index+blk_size*4];
      accum += buf[index+blk_size*5];
      accum += buf[index+blk_size*6];
      accum += buf[index+blk_size*7];
    }
  }
  time = MPI_Wtime() - e1;
  time /= iters;

It would be possible to flush the cache after each loop (by cycling through another cache-sized array). However this would take a while (seconds), and it is more appealing to just use separate areas of the cache.

Instruction caches

Instruction caches are not included in the measurements. Most systems will have enough on chip instruction cache to hold the short timing loops.

Memory modules and interleaving

Contention may occur within a network, or at the memory modules. Determining exactly where the contention is happening is very difficult.

Posix and Solaris threads

By default, global variables are shared between threads. Therefore, to arrange for threads to work independently on their own sets of data, it is necessary to allocate some data for each thread:-

struct local_data { 
  int index;
  int buf[csize];
};

A thread may access another thread's local data using an array of pointers:-

local_data ** remote_data;

Each thread then runs its own independent function:-

void* thr_main(void *r)
{
  local_data *l = (local_data*)r;
  int rank   = (int)l->index;
  sem_wait(&set_going);
  barrier();
  time_all(rank);
  barrier(); 
  return NULL;
}

Synchronisation may be performed using POSIX and pthread semaphores and condition variables:-

pthread_mutex_init(&mtx, NULL);
pthread_mutex_lock(&mtx);
pthread_mutex_unlock(&mtx); 
pthread_cond_init(&cond, NULL);
pthread_cond_wait(&cond, &mtx);
pthread_cond_broadcast(&cond);
sem_init(&set_going, 0, 0);
sem_post(&set_going);
sem_wait(&set_going);
pthread_create(&threads[i], NULL, thr_main, (void*)l);
pthread_join(threads[i], (void**)&status);
thr_setconcurrency(nprocs);

Solaris Threads

Solaris threads are largely identical to Posix threads. The main difference concerns the separation of lightweight processes (LWPs - operating system constructs, and the unit of concurrency) and threads (user level constructs). Many threads may share a LWP; or there may be one thread per LWP. The advantage of this two layer model is that user level context switching is faster than operating system level switching. The disadvantage is that it adds complexity.

Windows NT threads

Windows NT threads are (relatively) heavyweight operating system level objects. The startup costs are greater than Posix threads.

The Cray T3D

Shared memory on the Cray T3D differs from threads. Each processor runs a separate copy of the whole program, so N copies of main(), shown in the example below, are executed. Global variables are not shared; to access another processor's variables, it is necessary to use shmem_get and shmem_put. These can only be used on symettric data objects, i.e. C and C++ data allocated by shmalloc. Mutual exclusion locks are simply symettric variables of type long which are initialised to zero. A barrier operation is built in.

int buffer[csize];
int lbuf[csize];

main()
{
  nprocs = _num_pes();
  rank   = _my_pe();
  fill_buffer();
  if (rank==0) printf("Shared memory microbenchmarker (%d procs)\n", 
                      nprocs);
  barrier();
  time_all();
  barrier();
  if (rank==0) printf("Completed\n");
}

void time_read_remote(int nwords, int iter, double &time)
{
  double e1 = MPI_Wtime();
  for (int i=0; i<iters; i++) {
    shmem_get(lbuf,buffer+i,nwords,1);
  }
  time = MPI_Wtime() - e1;
}

extern "C" void barrier();
nprocs = _num_pes();
rank   = _my_pe();
shmem_get(lbuf,buffer+i,nwords,1);
shmem_put(lbuf,buffer+i,nwords,1);
shmem_clear_lock(long *lock);
shmem_set_lock(long *lock);
shmem_test_lock(long *lock);
shmem_wait( long *ivar, long cmp_value );

Sequent Symettry

The parallel programming model of the Sequent Symettry is like a cross between that of the Cray and threads. Only one instance of main is launched; an m_fork(func); is needed to start up the processes. By default, global data is not shared. Shared data must be enclosed within #pragma sequent_shared and #pragma sequent_shared_end, as shown in the example below. Synchronisation is performed using built in barriers and lock variables. Remote reads may be done by accessing the shmalloced data of another process.

#include <parallel/microtask.h>
#include <parallel/parallel.h>

#pragma sequent_shared
int nprocs;
#define CSIZE (1 << 16)
const int blk_size = 8;         /* Cache block size (# of ints) */
const int maxword = 1024;
int iters;
int smallsize,smalliters;
int medsize,mediters;
int largesize,largeiters;
#pragma sequent_shared_end
 
#pragma sequent_shared
local_data_s **remote_data;
#pragma sequent_shared_end

m_set_procs(nprocs);
remote_data = (local_data_s**)shmalloc(sizeof(plocal_data) *nprocs);
s_init_barrier(&bp, nprocs);
m_set_procs(nprocs);
m_fork(func);
m_kill_procs();
s_lock();
s_unlock();

Measurement examples

Examples of results obtained by running the shared memory microbenchmarker for two threads on a workstation are given below.

   Microbenchmark    :      Time (ns)
                     :   avg    ( max  )
----------------------------------------
          empty loop :        6 (     6)
             timer() :        4 (     4)
             barrier :      263 (   264)
     read_localcache :       69 (    69)
  allread_localcache :       69 (    74)
          read_local :      328 (   328)
       allread_local :      350 (   362)
      read_neighbour :      350 (   350)
   allread_neighbour :      359 (   362)
    write_localcache :       11 (    11)
 allwrite_localcache :       11 (    11)
         write_local :      184 (   184)
      allwrite_local :      235 (   235)
     write_neighbour :      229 (   229)
  allwrite_neighbour :      218 (   218)

The figures given are average times in nanoseconds to perform the given tasks. Barrier, timer overhead and loop overhead times are given first. The figures afterwards are for reads and writes to local main memory, local cache, and the neighbour process's memory. The allread and allwrite measurements are taken when all processes are performing the operation concurrently; with the other measurements, only one process is active.

This set of measurements is just one of many possible such sets of measurements which could have been carried out. On this machine (a 2-processor workstation with shared bus) it makes no difference whether you read your own or your neighbour's section of shared memory and local cache reads are about 7 times slower than local cache writes. These factors may be interesting for the program designer. But extracting the influence that the interconnect (in this case the Sparc MBus) has on these figures is not easy. A clue may come in the difference between the write and the allwrite figures - if it takes longer for all processes to write at the same time, one could conclude that contention was the cause. But this could as well be memory bank contention as network contention, and there is no way to find out which is which without opening the computer box and placing logic analyser probes on the DRAM chips (or similarly instrumenting simulation code). Of course, the application programmer doesn't particularly care where contention occurs, only on the possible degraded performance.

Addressing the measurement problems:

Influence of the cache. Since the caches play such an important part in the memory hierarchy performance, ensuring consistency ot their effects is vital. The secondary cache size and block size are inputs to the characterisation routine. To ensure a steady state before each measurement, the cache is flushed with a routine:-
```
void flush_cache(int rank) {
  int accum;
  for (int i=0; i<csize; i+=blk_size) {
    accum += remote_data[rank]->buf1[i];
  }
}
```
This steps through the entire cache reading one word per cache line, to ensure that the cache starts in a known state.
Inaccuracy of the timer and loop overheads. The time spent in the timer and on loop overheads is measured using an empty loop. Possible lack of resolution of the timer is dealt with by repeating each measurement a large number of times. The high resolution timers typically give time in nanoseconds to the nearest 500ns, so for nanosecond resolution measurements are repeated at least 1000 times (and typically 1,000,000 times).
Code generation. The measurement code is written in C++; this means that there is a possibility that the compiler will generate more memory references than would be expected for a given loop. Rather than code the timing loops in machine code (which would guarantee the number of memory references, but would require different code for each architecture), the compiler was set to generate assembler code and this was checked manually. It could be argued that people never write in machine code anyway, so this step was unnecessary - what matters is machine performance given normal code.
Memory bank contention. This is difficult to quantify, as it depends on the precise memory subsystem of the machine. Because of this, it is left unquantified and assumed to be one of the reasons different memory access patterns will perform differently.

Detailed notes and download details for the software are available on the shared memory microbenchmarker home page.

Presentation of results

One useful way to characterise the performance of operations is in the form of a logarithmic table, for example:-

Ten to power minus:
        9
        8
        7       Cache read hit  Wr main (if buffered OK)
        6       Rd Main memory
        5       daxpy100, fprintf
        4       daxpy1000
        3       thr_create      daxpy10000
        2       fopen, fclose
        1
        0

Routines were written to carry out these measurements for some common operations (using files, reading and writing memory, creating threads). The example below shows results obtained for a measurement on a Sun SPARCstation-20 :-

     Operation       : Order of magnitude (log10(time in secs))
-----------------------------------------
               fopen : -1
    fprintf (1 char) : -4
     printf (string) : -3
              fclose : -2
    rd main mem 8000 : -6
   rd cache mem 8000 : -7
    wr main mem 8000 : -6
   wr cache mem 8000 : -7
          thr_create : -3
            daxpy100 : -4
           daxpy1000 : -3
          daxpy10000 : -2

This format of results presentation is more useful for rough performance estimation than the 'exact' timing of operations in nanoseconds - but less useful for comparing detailed contention issues.

Summary

This section has presented methods for measuring and estimating the microbenchmark performance of parallel machines, for both shared memory and message passing programming models. The memory subsystem is the most dificult part to characterise meaningfully; some of the reasons for this were discussed, and some attempts at solution were presented. Detailed notes and download details for the software are available on the shared memory microbenchmarker home page.

Microbenchmarks can be used in conjunction with simulation models in several ways; as measurements from a network simulation model, feeding into a simulation of parallel software and as a simplified workload. The next section considers the last of these approaches, using a microbenchmark style of interface between network models and workloads.

Return to Contents page

Fred Howell

Last modified: Tue May 26 1998