CS3 Computer Architecture Assignment 1997
Stage C - ``Caching Large Data Structures''

Issue date: Friday 7th February

Due date: Monday 24th February

The purpose of this third part of the coursework is to evaluate the behaviour of caches on applications which use a significant amount of data, and which ``stress'' the cache to the limit. Hopefully you will see that naive use of caches, typically algorithms which access large data structures in naive ways, can lead to serious performance problems.

The Exercise

You are given a program called mxm, which computes the product of two matrices. You are to simulate the behavour of this program for a variety of matrix sizes and see how the cache behaviour varies as a function of one particular array dimension.

The Test Program

If we have two matrices a and b, which are of dimensions $l \times m$ and $m \times n$ respectively, then the product of these matrices, c, will be a $l \times n$ matrix whose values are defined by equation 1.

$\begin{displaymath} c_{i,k} = \sum_{j = 0}^{m-1} a_{i,j}.b_{j,k}\end{displaymath}$ (1)

where $i \in (0,l-1)$ and $k \in (0,n-1)$ .

There are a number of ways of computing the values of c, and the one chosen for this practical computes the elements of c using the nested loop structure shown below:

    for (j = 0; j < m; j++)
      for (k = 0; k < n; k++)
        for (i = 0; i < l; i++)
          c[i][k] = c[i][k] + a[i][j] * b[j][k];

Now, there ought to be plenty of locality in this algorithm as each element of a and b is read m times, and each element of c is read and written m times. However, spatial locality on the a matrix is actually quite poor, and temporal locality on the c matrix is also poor.

Question 1:

Explain why there is little spatial locality within the accesses to a, and why there is little temporal locality within the accesses to c.

You may have noticed by now that the reference to b[j][k] is a loop invariant expression at the inner-most loop. Consequently, we can hoist it to an outer loop and keep the value we need within a scalar variable. The effect will be to assign the current value of b[j][k] to a register, and thus avoid one load instruction within the inner loop. The resulting code looks like this (assuming a suitable declaration for r1).

    for (j = 0; j < m; j++)
      for (k = 0; k < n; k++){
        r1 = b[j][k];
        for (i = 0; i < l; i++)
          c[i][k] = c[i][k] + a[i][j] * r1;
      }

Question 2:

How does the cache miss rate vary as we alter the value of l?

To answer this you will need to perform a set of seven simulations using dinero. Take a copy of all the files located in directory:

   ~cs3/comparch/Coursework/src/stageC

You will find a Makefile which is already set up to compile the matrix multiply program. The program itself is in four parts: there are two header files (data.h and trace.h) and two C source files (mxm.c and main.c). None of these files should need to be modified for this stage of the assignment, although you will be required to modify mxm.c for stage D. The mxm.c file contains a single routine called mxm() which actually performs the multiplication. main(), in main.c, reads the command line parameters, initializes the arrays, calls mxm() and then checks that the results are numerically correct.

The command line parameters for the mxm program are:-

-l: sets the value of l
-m: sets the value of m
-n: sets the value of n
-q: (quiet) if present, the program will not generate any trace output for dinero
-s: (show) if present, the program will print out the results in the c matrix upon completion.

So, to compute the product of a $128 \times 32$ matrix and a $32 \times 16$ matrix, one would type:

     mxm -l128 -m32 -n32

The initilization routine fills the a and b matrices with pseudo-random floating point values prior to the computation, and after computing the product, the result is checked against a ``standard'' algorithm (this will only become significant in the final stage of the assignment).

The output of mxm can be piped directly into dinero. If you inspect mxm.c you will find that there are calls to a Trace() macro which effectively computes the address of each array element accessed and prints this to the standard output stream in dinero format. If the -q flag is present on the command line no such output is produced.

You should assume that the initial cache parameters are those given in table 1.

**Table 1:** Initial cache parameters
Parameter	Value	Units
Block size	16	bytes
Associativity	1
Total capacity	4096	bytes
Write policy	CB
Write allocation policy	WA
Replacement algorithm	LRU

You should simulate for m=n=16 and allow l to take on values from the set $\{ 16,32,64,128,256,512,1024\}$ .Your results should be tabulated (at least) but preferably illustrated in the form of a graph.

Explain the data miss rate behaviour you observe. In particular, explain why the miss rate starts to level out for very large matrices.

Question 3:

How does the cache miss rate vary as a function of the degree of associativity in the cache?

To answer this, I suggest that you take three data points from question 2 (say l=16,128,1024) and for each of them simulate with 2-way, and 4-way set-associativity (you should already have the results for a direct-mapped cache).

Question 4:

Explain why changing the associativity alters the miss rate, as seen in the previous question?

Question 5:

What conclusions would you draw about the effectiveness of caches on large data structures?

Question 6:

Can you suggest any way in which the miss rate can be brought within a reasonable range (say less than 20%) for programs such as this?

Format of your submission

Your submission for this part of the assignment should consist of a short report, between 2 and 4 pages in length, which is likely to include a number of graphs:

Your submission should not only describe, but also explain, the behaviour of the cache.

About this document ...

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

The command line arguments were:
latex2html -split 1 partIII.

The translation was initiated by Nigel Topham on 6/25/1998

Nigel Topham
6/25/1998