next up previous


CS3 Computer Architecture Assignment 1998
Stage D - ``Program Transformations''

Issue date: Tuesday 25th February
Due date: Friday 14th March

The purpose of this fourth part of the coursework is to investigate the interface between compiler and architecture and see how they are inextricably linked.

The Exercise

You have been given a program called mxm, which computes the product of two matrices. You should now be familiar with its internal behaviour and the ways in which locality exists (or not!) within its pattern of memory references.

Recall that the nested loop structure for computing the matrix-matrix product is as shown below:

    for (j = 0; j < m; j++)
      for (k = 0; k < n; k++){
        for (i = 0; i < l; i++)
          c[i][k] = c[i][k] + a[i][j] * b[j][k];
      }

We saw how a simple tranformation for loop-invariant code motion could reduce the number of memory references (which as a by-product increases the miss rate - though not the total number of misses). This produced the optimized loop nest:

    for (j = 0; j < m; j++)
      for (k = 0; k < n; k++){
        r1 = b[j][k];
        for (i = 0; i < l; i++)
          c[i][k] = c[i][k] + a[i][j] * r1;
      }

We saw in stage C how locality varied inversely with the problem size, and that large arrays produced consistently bad behaviour (two Read misses, followed by one Write hit; yielding a 66% miss rate for values of l > 128).

In this part of the assignment you are required to modify the algorithm in order to achieve the minimum possible miss rate. The aim of the task is to transform the loop structure in ways that could be automated and built into a compiler. You may include or exclude the loop-invariant code motion depending on the transformation being applied.

The Task

You are to produce a new version of mxm by transforming the mxm() routine. The goal of the transformation is a reduction in the total cache miss rate[*] You are given three transformations to evaluate: data transposition, loop interchange and tiling. You must apply each of these to the loop in the mxm() routine. For each transformation you must simulate the same set of conditions stipulated in Question 2 of stage C (i.e. investigate cache miss rate as a function of problem size).

Your write-up should include at least the following, for each transformation.

For the tiling transformation, where the tile dimensions are a parameter to the program, you should experiment to find the best tile dimensions, and include your results in your write-up.

You should finish your writeup with your own conclusions on the effectiveness of these transformations. In particular you should comment on the ability to achieve miss rates below 20% even on very large arrays.

Catalogue of Transformations

The idea behind the transformations is to perform the same numerical computations but with the elements of the computation performed in a different order, or with the operands located in different memory locations.

Data transposition

Whenever we have a sequence of accesses to a set of data elements that are not located in consecutive words in memory we run the risk of not extracting all possible spatial locality. We may also use much more cache space than is necessary. Consider the following loop nest:

    for (i = 0; i < m; i++)
       for (j = 0; j < n; j++)
          t[i][j] = t[i][j] + a[j][k];

In this loop we reference elements of a[j][k] in the order a[0][k], a[1][k], a[2][k], a[3][k], etc. This performs badly because the array is laid out in memory in the order a[0][0], a[0][1], a[0][2], a[0][3], etc. In this example we are likely to only access one word in each cache block. The situation can be improved by copying the kth column of the a[][] matrix into a temporary vector so that the elements we need are arranged in contiguous words of memory. Here is the transformed code:

    for (j = 0; j < n; j++)
       tmp[j] = a[j][k];
    for (i = 0; i < m; i++)
       for (j = 0; j < n; j++)
          t[i][j] = t[i][j] + tmp[j];

Notice how the act of writing to tmp[] brings the data into cache in the required form (assuming a policy of ``allocate on write miss'').

Loop interchange

Consider the following loop:
    for (j = 0; j < n; j++)
       for (i = 0; i < m; i++)
          t[i] = t[i] + a[i][j];

The references to a[i][j] are ordered such that spatial locality will be poor. This is because the array is laid out in memory in the order a[0][0], a[0][1], a[0][2], a[0][3], etc. but accessed in the order a[0][0], a[1][0], a[2][0], etc.

As all computations within the $n\times m$ loop bodies are independent of each other we can perform those computations in any order we wish. Ordering is easily changed by interchanging the loops. Thus, the interchanged code would look like this:

    for (i = 0; i < m; i++)
       for (j = 0; j < n; j++)
          t[i] = t[i] + a[i][j];

Now we can also see that the index of t[i] is loop invariant at the inner-loop level, and this means we can scalarize the access to that variable, thus:

    for (i = 0; i < m; i++){
       s = 0;
       for (j = 0; j < n; j++)
          s = s + a[i][j];
       t[i] = s;
    }

Scalarization and loop interchange are completely independent transformations, but we can see here how applying one makes the other possible.

Tiling

The tiling transformation involves a sub-division of the iteration space in two or more dimensions. Lets first consider how we can sub-divide the iteration space of a single dimension: consider the loop below.

    for (i = 0; i < m; i++)
       a[i] = b[i];

We can strip-mine this loop into sections of size VL by the following transformation

    for (j = 0; j < m; j += VL)
       for (i = j; i < j+VL; i++)
          a[i] = b[i];

This can be extended into two dimensions quite naturally. Consider the following doubly-nested loop.

    for (i = 0; i < m; i++)
       for (j = 0; j < n; j++)
          a[i][j] = b[i][j];

This can be broken down into groups of iterations of size TI and TJ (effectively rectangular sections of the iteration space). The resulting transformation looks like this:

    for (ti = 0; ti < m; ti += TI)
       for (tj = 0; tj < n; tj += TJ)
          for (i = ti; i < min(ti + TI, m); i++)
             for (j = tj; j < min(tj + TJ, n); j++)
                a[i][j] = b[i][j];

Needless to say, this is best understood using a graphical representation of the iteration space! If it helps, try constructing such a representation.

Implementation Note:


The version of main.c which is available for this stage of the practical has additional command line parameters defined. These are:
-tx
tile size in row dimension
-ty
tile size in column dimension
Each should be followed by a positive integer greater than zero. They set global integer variables tileX and tileY respectively. You may use those variables in place of the TI and TJ variables shown in the transformation above.

Optional Part

If you are feeling adventurous, you may wish to try re-writing the matrix multiply routine using a completely different algorithm. Clearly this sort of activity is not likely to be integrated within a compiler, but you may find it interesting to pit your skills and knowledge of algorithms against the mechanical transformations given above!

Format of your submission

Your submission for this part of the assignment should consist of a short report, between 2 and 4 pages of text. You may include as many short code listings as you need. A graphical comparison of transformed loops versus original loop would be appropriate.

About this document ...

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 1 partIV.

The translation was initiated by Nigel Topham on 6/25/1998


next up previous
Nigel Topham
6/25/1998