Issue date: | Tuesday 25th February |
Due date: | Friday 14th March |
The purpose of this fourth part of the coursework is to investigate the interface between compiler and architecture and see how they are inextricably linked.
You have been given a program called mxm, which computes the product of two matrices. You should now be familiar with its internal behaviour and the ways in which locality exists (or not!) within its pattern of memory references.
Recall that the nested loop structure for computing the matrix-matrix product is as shown below:
for (j = 0; j < m; j++) for (k = 0; k < n; k++){ for (i = 0; i < l; i++) c[i][k] = c[i][k] + a[i][j] * b[j][k]; }
We saw how a simple tranformation for loop-invariant code motion could reduce the number of memory references (which as a by-product increases the miss rate - though not the total number of misses). This produced the optimized loop nest:
for (j = 0; j < m; j++) for (k = 0; k < n; k++){ r1 = b[j][k]; for (i = 0; i < l; i++) c[i][k] = c[i][k] + a[i][j] * r1; }
We saw in stage C how locality varied inversely with the problem size, and that large arrays produced consistently bad behaviour (two Read misses, followed by one Write hit; yielding a 66% miss rate for values of l > 128).
In this part of the assignment you are required to modify the algorithm in order to achieve the minimum possible miss rate. The aim of the task is to transform the loop structure in ways that could be automated and built into a compiler. You may include or exclude the loop-invariant code motion depending on the transformation being applied.
Your write-up should include at least the following, for each transformation.
For the tiling transformation, where the tile dimensions are a parameter to the program, you should experiment to find the best tile dimensions, and include your results in your write-up.
You should finish your writeup with your own conclusions on the effectiveness of these transformations. In particular you should comment on the ability to achieve miss rates below 20% even on very large arrays.
for (i = 0; i < m; i++) for (j = 0; j < n; j++) t[i][j] = t[i][j] + a[j][k];
In this loop we reference elements of a[j][k] in the order a[0][k], a[1][k], a[2][k], a[3][k], etc. This performs badly because the array is laid out in memory in the order a[0][0], a[0][1], a[0][2], a[0][3], etc. In this example we are likely to only access one word in each cache block. The situation can be improved by copying the kth column of the a[][] matrix into a temporary vector so that the elements we need are arranged in contiguous words of memory. Here is the transformed code:
for (j = 0; j < n; j++) tmp[j] = a[j][k]; for (i = 0; i < m; i++) for (j = 0; j < n; j++) t[i][j] = t[i][j] + tmp[j];
Notice how the act of writing to tmp[] brings the data into cache in the required form (assuming a policy of ``allocate on write miss'').
for (j = 0; j < n; j++) for (i = 0; i < m; i++) t[i] = t[i] + a[i][j];
The references to a[i][j] are ordered such that spatial locality will be poor. This is because the array is laid out in memory in the order a[0][0], a[0][1], a[0][2], a[0][3], etc. but accessed in the order a[0][0], a[1][0], a[2][0], etc.
As all computations within the loop bodies are independent
of each other we can perform those computations in any order we wish.
Ordering is easily changed by interchanging the loops. Thus, the
interchanged code would look like this:
for (i = 0; i < m; i++) for (j = 0; j < n; j++) t[i] = t[i] + a[i][j];
Now we can also see that the index of t[i] is loop invariant at the inner-loop level, and this means we can scalarize the access to that variable, thus:
for (i = 0; i < m; i++){ s = 0; for (j = 0; j < n; j++) s = s + a[i][j]; t[i] = s; }
Scalarization and loop interchange are completely independent transformations, but we can see here how applying one makes the other possible.
for (i = 0; i < m; i++) a[i] = b[i];
We can strip-mine this loop into sections of size VL by the following transformation
for (j = 0; j < m; j += VL) for (i = j; i < j+VL; i++) a[i] = b[i];
This can be extended into two dimensions quite naturally. Consider the following doubly-nested loop.
for (i = 0; i < m; i++) for (j = 0; j < n; j++) a[i][j] = b[i][j];
This can be broken down into groups of iterations of size TI and TJ (effectively rectangular sections of the iteration space). The resulting transformation looks like this:
for (ti = 0; ti < m; ti += TI) for (tj = 0; tj < n; tj += TJ) for (i = ti; i < min(ti + TI, m); i++) for (j = tj; j < min(tj + TJ, n); j++) a[i][j] = b[i][j];
Needless to say, this is best understood using a graphical representation of the iteration space! If it helps, try constructing such a representation.
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 1 partIV.
The translation was initiated by Nigel Topham on 6/25/1998