Microbenchmarking provides these measurements; rather than give one number for the performance, each of the primitive operations making up the performance is measured. This is a complementary process to standard benchmarking, but provides more detailed performance information.
In this application `microbenchmarks' can be seen as the performance characterisation of the interface layer between hardware and software of multiprocessor networks. Routines are descirbed for measuring shared memory and message passing performance, as well as calculating `microbenchmarks' analytically. The use of microbenchmarks in simulation is also described, both as the output of low level hardware models and as the input to high level software models.
Measurement
The `primitive operations' of a parallel machine depend on the programming
model; the two most important models are message passing and shared
memory.
MPI Function | Time (us) | Goodness of fit (Q) | ||
---|---|---|---|---|
send | 5000+ | 0* d | 1 | |
ssend | 10000+ | 0* d | 1 | |
rsend | 5000+ | 4* d | 1 | |
recv | 8000+ | 0* d | 1 | |
recvmin | 3000+ | 80* d | 0.35 | |
irecv1 | 70+ | 0* d | 0.98 | |
irecv2 | 9000+ | 0* d | 0.99 | |
irecvoverlap | 5000+ | 30* d | 0.88 | |
sendrecv | 10000+ | 50* d | 1 | |
pingpong | 10000+ | 50* d | 1 | |
alltoall | 0 + | 4000* p^2 + | 20* d | 1 |
gather | 0 + | 30000* log(p) + | 0* p^2 * d | 1 |
allgather | 0 + | 9000* p + | 40* d | 1 |
reduce | 0 + | 20000* log(p) + | 9* d | 1 |
allreduce | 0 + | 9000* p + | 30* d | 1 |
bcast | 1000 + | 600* p^2 + | 1* p^2 * d | 1 |
These routines were originally developed for software performance prediction; more details are available in the User Guide, the performance characterisation chapter . from (Fred Howell, Approaches to Parallel Performance Prediction, PhD Thesis, Dept. of Computer Science, University of Edinburgh, 1996). and Reverse Profiling, a paper presented at ICSE, 1st International Workshop on Parallel and Distributed Software Engineering, Berlin, March 1996. Describes the MPI timing routines and a way of using the characterisation to generate performance predictions of MPI programs using the MPI profiling interface.
However, precisely because they are useful for software performance prediction, they are also useful as an output from simulation models.
Characterising the performance of the memory system is difficult because of the number of levels in the hierarchy, and because of the possibility of both network and memory bank interference between different memory accesses.
Measuring the low level memory access times from a high level benchmark program is an intricate task. There are many potential sources of error. The following sections describe the techniques used for measurements on the Sequent Symettry programming model, POSIX and Solaris threads on multiprocessor workstations, the Cray T3D shared memory operations and NT threads.
A single pass through this array, reading every element, will flush out the old contents of the cache. If the array is at least twice the size of the cache, subsequent passes through the array will suffer continuous cache misses.
In addition to the cache size, the block size is also a parameter. The worst case involves stepping through buf with a step size equal to the block size.
It is possible to determine the cache size and the block size experimentally but this is not easy, since the times involved are so short, and there is the danger of process switching interfering with the timings. So these figures are left as parameters.
It is most attractive to make the measurements using a compiled language like C++, but the measurements are at the mercy of the code generator, so it is essential to check that the generated assembler contains only the required memory references.
The loop overheads interfere with the timings. This may be measured using a ``control'' empty loop. The impact may also be reduced by ``unrolling'' the loop to contain a sequence of memory operations.
The actual function used to measure the time introduces another variable. The timer resolution is typically 1us or the clock rate of the processor, and the overhead of calling the timer may be severe.
The loop to time reading .nwords. words from main memory (unrolled eight times) is:-
double e1 = MPI_Wtime(); for (int j=0; j<nwords; j+=8, index += 8*blk_size) { accum += buf[index]; accum += buf[index+blk_size*2]; accum += buf[index+blk_size*3]; accum += buf[index+blk_size*4]; accum += buf[index+blk_size*5]; accum += buf[index+blk_size*6]; accum += buf[index+blk_size*7]; } time = MPI_Wtime() - e1;The time returned will be composed of:
{time to read nwords} + {loop overhead} + {timer overhead} + {indexing and adding overhead} + e
where .e. is a random error caused by context switching/virtual memory paging.
In order to extract the required time to read nwords, it is either necessary to subtract the overheads, or to ensure that they form an insignificant portion of the time. The indexing and adding overhead is small compared to the memory access time, and is required in most programs.
The above loop must be repeated (as shown below) for long enough to minimise the one off timer overhead (and to compensate for the timer's lack of resolution). This must be done with care to avoid spurious cache hits.
double e1 = MPI_Wtime(); for (int i=0; i<iters; i++) { for (int j=0; j<nwords; j+=8, index += 8*blk_size) { accum += buf[index]; accum += buf[index+blk_size*2]; accum += buf[index+blk_size*3]; accum += buf[index+blk_size*4]; accum += buf[index+blk_size*5]; accum += buf[index+blk_size*6]; accum += buf[index+blk_size*7]; } } time = MPI_Wtime() - e1; time /= iters;
It would be possible to flush the cache after each loop (by cycling through another cache-sized array). However this would take a while (seconds), and it is more appealing to just use separate areas of the cache.
struct local_data { int index; int buf[csize]; };A thread may access another thread's local data using an array of pointers:-
local_data ** remote_data;Each thread then runs its own independent function:-
void* thr_main(void *r) { local_data *l = (local_data*)r; int rank = (int)l->index; sem_wait(&set_going); barrier(); time_all(rank); barrier(); return NULL; }Synchronisation may be performed using POSIX and pthread semaphores and condition variables:-
pthread_mutex_init(&mtx, NULL); pthread_mutex_lock(&mtx); pthread_mutex_unlock(&mtx); pthread_cond_init(&cond, NULL); pthread_cond_wait(&cond, &mtx); pthread_cond_broadcast(&cond); sem_init(&set_going, 0, 0); sem_post(&set_going); sem_wait(&set_going); pthread_create(&threads[i], NULL, thr_main, (void*)l); pthread_join(threads[i], (void**)&status); thr_setconcurrency(nprocs);
int buffer[csize]; int lbuf[csize];
main() { nprocs = _num_pes(); rank = _my_pe(); fill_buffer(); if (rank==0) printf("Shared memory microbenchmarker (%d procs)\n", nprocs); barrier(); time_all(); barrier(); if (rank==0) printf("Completed\n"); }
void time_read_remote(int nwords, int iter, double &time) { double e1 = MPI_Wtime(); for (int i=0; i<iters; i++) { shmem_get(lbuf,buffer+i,nwords,1); } time = MPI_Wtime() - e1; }
extern "C" void barrier(); nprocs = _num_pes(); rank = _my_pe(); shmem_get(lbuf,buffer+i,nwords,1); shmem_put(lbuf,buffer+i,nwords,1); shmem_clear_lock(long *lock); shmem_set_lock(long *lock); shmem_test_lock(long *lock); shmem_wait( long *ivar, long cmp_value );
#include <parallel/microtask.h> #include <parallel/parallel.h>
#pragma sequent_shared int nprocs; #define CSIZE (1 << 16) const int blk_size = 8; /* Cache block size (# of ints) */ const int maxword = 1024; int iters; int smallsize,smalliters; int medsize,mediters; int largesize,largeiters; #pragma sequent_shared_end #pragma sequent_shared local_data_s **remote_data; #pragma sequent_shared_end
m_set_procs(nprocs); remote_data = (local_data_s**)shmalloc(sizeof(plocal_data) *nprocs); s_init_barrier(&bp, nprocs); m_set_procs(nprocs); m_fork(func); m_kill_procs(); s_lock(); s_unlock();
Microbenchmark : Time (ns) : avg ( max ) ---------------------------------------- empty loop : 6 ( 6) timer() : 4 ( 4) barrier : 263 ( 264) read_localcache : 69 ( 69) allread_localcache : 69 ( 74) read_local : 328 ( 328) allread_local : 350 ( 362) read_neighbour : 350 ( 350) allread_neighbour : 359 ( 362) write_localcache : 11 ( 11) allwrite_localcache : 11 ( 11) write_local : 184 ( 184) allwrite_local : 235 ( 235) write_neighbour : 229 ( 229) allwrite_neighbour : 218 ( 218)
The figures given are average times in nanoseconds to perform the given tasks. Barrier, timer overhead and loop overhead times are given first. The figures afterwards are for reads and writes to local main memory, local cache, and the neighbour process's memory. The allread and allwrite measurements are taken when all processes are performing the operation concurrently; with the other measurements, only one process is active.
This set of measurements is just one of many possible such sets of measurements which could have been carried out. On this machine (a 2-processor workstation with shared bus) it makes no difference whether you read your own or your neighbour's section of shared memory and local cache reads are about 7 times slower than local cache writes. These factors may be interesting for the program designer. But extracting the influence that the interconnect (in this case the Sparc MBus) has on these figures is not easy. A clue may come in the difference between the write and the allwrite figures - if it takes longer for all processes to write at the same time, one could conclude that contention was the cause. But this could as well be memory bank contention as network contention, and there is no way to find out which is which without opening the computer box and placing logic analyser probes on the DRAM chips (or similarly instrumenting simulation code). Of course, the application programmer doesn't particularly care where contention occurs, only on the possible degraded performance.
Addressing the measurement problems:
void flush_cache(int rank) { int accum; for (int i=0; i<csize; i+=blk_size) { accum += remote_data[rank]->buf1[i]; } }This steps through the entire cache reading one word per cache line, to ensure that the cache starts in a known state.
Detailed notes and download details for the software are available on the shared memory microbenchmarker home page.
Ten to power minus: 9 8 7 Cache read hit Wr main (if buffered OK) 6 Rd Main memory 5 daxpy100, fprintf 4 daxpy1000 3 thr_create daxpy10000 2 fopen, fclose 1 0
Routines were written to carry out these measurements for some common operations (using files, reading and writing memory, creating threads). The example below shows results obtained for a measurement on a Sun SPARCstation-20 :-
Operation : Order of magnitude (log10(time in secs)) ----------------------------------------- fopen : -1 fprintf (1 char) : -4 printf (string) : -3 fclose : -2 rd main mem 8000 : -6 rd cache mem 8000 : -7 wr main mem 8000 : -6 wr cache mem 8000 : -7 thr_create : -3 daxpy100 : -4 daxpy1000 : -3 daxpy10000 : -2
This format of results presentation is more useful for rough performance estimation than the 'exact' timing of operations in nanoseconds - but less useful for comparing detailed contention issues.
Microbenchmarks can be used in conjunction with simulation models in several ways; as measurements from a network simulation model, feeding into a simulation of parallel software and as a simplified workload. The next section considers the last of these approaches, using a microbenchmark style of interface between network models and workloads.