Using LoopReport

23

The Fortran MP and MP C compilers automatically parallelize loops that they determine safe and profitable to parallelize. LoopReport is a performance analysis tool that reads loop timing files created by these compilers.

This chapter is organized as follows:

Basic Concepts

page 247

Setting Up Your Environment

page 248

Creating a Loop Timing File

page 249

Starting LoopReport

page 249

Other Compilation Options

page 251

Fields in the Loop Report

page 254

Understanding Compiler Hints

page 255

Compiler Optimizations and How They Affect Loops

page 260

Basic Concepts	page 247
Setting Up Your Environment	page 248
Creating a Loop Timing File	page 249
Starting LoopReport	page 249
Other Compilation Options	page 251
Fields in the Loop Report	page 254
Understanding Compiler Hints	page 255
Compiler Optimizations and How They Affect Loops	page 260

Basic Concepts

LoopReport is the command line version of LoopTool. LoopReport produces an ASCII file of loop times.

LoopReport's main features include the ability to:

Time all loops, whether serial or parallel
Produce a table of loop timings
Collect hints from the compiler, during compilation. These hints can help you parallelize loops that were not parallelized. Hints are described further in "Understanding Compiler Hints" on page 255.

Using LoopReport is similar to using gprof. The three major steps are: compile, run, and analyze.

Note - The following examples use the Fortran MP (f77) compiler. The options shown (such as -xparallel and -Zlp) work also for MP C.

Setting Up Your Environment

1. Before compiling, set the environment variable PARALLEL to the number of processors on your machine.

The following command makes use of psrinfo, a system utility. Note the backquotes:


% setenv PARALLEL `/usr/sbin/psrinfo \| wc -l`

Note - If you have installed LoopReport in a non-default directory, substitute that path for the one shown here.

2. Before starting LoopReport, make sure the environment variable XUSERFILESEARCHPATH is set:


% `setenv XUSERFILESEARCHPATH \ /opt/SUNWspro/lib/sunpro_defaults/looptool.res`

3. Set LD_LIBRARY_PATH.

If you are running Solaris 2.5:


% `setenv LD_LIBRARY_PATH /usr/dt/lib:$LD_LIBRARY_PATH`

If you are running Solaris 2.4:


% `setenv LD_LIBARY_PATH \` `/opt/SUNWspro/Motif_Solaris24/dt/lib:$LD_LIBRARY_PATH`

You may want to put these commands in a shell start-up file (such as .cshrc or .profile).

Creating a Loop Timing File

To compile for automatic parallelization, typical compilation options are -xparallel and -xO4. To compile for LoopReport, add -Zlp, as shown in the following example:


% f77 -xO4 -xparallel -Zlp source_file

Note - All examples apply to Fortran77, Fortran90 and C programs.

After compiling with -Zlp, run the instrumented executable. This creates the loop timing file, program.looptimes. LoopReport processes two files: the instrumented executable and the loop timing file.

Starting LoopReport

When it starts up, LoopReport expects to be given the name of your program. Type loopreport and the name of the program (an executable) you want examined.


% loopreport program

You can also start LoopReport with no file specified. However, if you invoke LoopReport without giving it the name of a program, it looks for a file named a.out in the current working directory.

% loopreport > a.out.loopreport

% loopreport > a.out.loopreport

You can also direct the output into a file, or pipe it into another command:

% loopreport program > program.loopreport

% loopreport program | more

% loopreport program > program.loopreport % loopreport program \| more

Timing File

LoopReport also reads the timing file associated with your program. The timing file is created when you use the -zlp option, and contains information about loops. Typically, this file has a name of the format program.looptimes, and is found in the same directory as your program.

However, there are four ways to specify the location of a timing file. LoopReport chooses a timing file according to the rules listed below.

If a timing file is named on the command line, LoopReport uses that file.
% loopreport program newtimes > program.loopreport
If the command-line option -p is used, LoopReport looks in the directory named by -p for a timing file.
% loopreport program -p /home/timingfiles > program.loopreport
If the environment variable LVPATH is set, LoopReport looks in that directory for a timing file.
% setenv LVPATH /home/timingfiles

% loopreport program > program.loopreport
LoopReport writes the table of loop statistics to stdout--the standard output. You can also redirect the output to a file, or pipe it into another command:
% loopreport program > program.loopreport

% loopreport program | more

% loopreport program newtimes > program.loopreport

% loopreport program -p /home/timingfiles > program.loopreport

% setenv LVPATH /home/timingfiles % loopreport program > program.loopreport

% loopreport program > program.loopreport % loopreport program \| more

Other Compilation Options

To compile for automatic parallelization, typical compilation switches are -xparallel and -x04. To compile for LoopReport, add -Zlp.


% f77 -x04 -xparallel -Zlp source_file

There are several other useful options for examining and parallelizing loops.

Option
Effect

-o program

Renames the executable to program

-xexplicitpar

Parallelizes loops marked with DOALL pragma

-xloopinfo

Prints hints to stderr for redirection to files

Option	Effect
`-o` program	Renames the executable to program
`-xexplicitpar`	Parallelizes loops marked with `DOALL` pragma
`-xloopinfo`	Prints hints to `stderr` for redirection to files

Either -xO3 or -xO4 can be used with -xparallel. If you don't specify -xO3 or -xO4 but you do use -xparallel, then -xO3 is added. The table below summarizes how switches are added.

You type:
Bumped Up:

-xparallel

-xparallel -xO3

-xparallel -Zlp

-xparallel -xO3 -Zlp

-xexplicitpar

-xexplicitpar -xO3

-xexplicitpar -Zlp

-xexplicitpar -xO3 -Zlp

-Zlp

-xdepend -xO3 -Zlp

You type:	Bumped Up:
-xparallel	-xparallel -xO3
-xparallel -Zlp	-xparallel -xO3 -Zlp
-xexplicitpar	-xexplicitpar -xO3
-xexplicitpar -Zlp	-xexplicitpar -xO3 -Zlp
-Zlp	-xdepend -xO3 -Zlp

The -xexplicitpar and -xloopinfo have specific applications.

`-xexplicitpar`

The Fortran MP compiler switch -xexplicitpar is used with the pragma DOALL. If you insert DOALL before a loop in your source code, you are explicitly marking that loop for parallelization. The compiler will parallelize this loop when you compile with -xexplicitpar.

The following code fragment shows how to mark a loop explicitly for parallelization.

subroutine adj(a,b,c,x,n)

real*8 a(n), b(n), c(-n:0), x

integer n

c$par DOALL

do 19 i = 1, n*n

do 29 k = i, n*n

a(i) = a(i) + x*b(k)*c(i-k)

29 continue

19 continue

return

end

subroutine adj(a,b,c,x,n) real8 a(n), b(n), c(-n:0), x integer n c$par DOALL do 19 i = 1, nn do 29 k = i, nn a(i) = a(i) + xb(k)*c(i-k) 29 continue 19 continue return end

When you use -Zlp by itself, -xdepend and -xO3 are added. The switch -xdepend instructs the compiler to perform the data dependency analysis that it needs to do to identify loops. The switch -xparallel includes -xdepend, but -xdepend does not imply (or trigger) -xparallel.

`-xloopinfo`

The -xloopinfo option prints hints about loops to stderr (the UNIX standard error file, on file descriptor 2) when you compile your program. The hints include the routine names, line number for the start of the loop, whether the loop was parallelized, and, if appropriate, the reason it was not parallelized.

The following example redirects hints about loops in the source file gamteb.F to the file named gamteb.loopinfo.


% `f77 -xO3 -parallel -xloopinfo -Zlp gamteb.F 2> gamteb.loopinfo`

The main difference between -Zlp and -xloopinfo is that in addition to providing you with compiler hints about loops, -Zlp also instruments your program so that timing statistics are recorded at runtime. For this reason, also, LoopReport analyzes only programs that have been compiled with -Zlp.

Figure 23-1 Sample Loop Report

Fields in the Loop Report

The loop report contains the following information:

Loopid

An arbitrary number, assigned by the compiler during compile time. This is just an internal loopid, useful for talking about loops, but not really related in any way to the user's program.
Line#

The line number of the first statement of the loop in the source file.
Par?

"Parallelized by the compiler?" Y means that this loop was marked for parallelization; N means that the loop was not.

Entries

Number of times this loop was entered from above. This is distinct from the number of loop iterations, which is the total number of times a loop executes. For example, these are two loops in Fortran.


do 10 i=1,17 do 10 j=1,50 ...some code... 10 continue

The first loop is entered once, and it iterates 17 times. The second loop is entered 17 times, and it iterates 17*50 = 850 times.

Nest

Nesting level of the loop. If a loop is a top-level loop, its nesting level is 0. If the loop is the child of another loop, its nesting level is 1.

For example, in this C code, the i loop has a nesting level of 0, the j loop has a nesting level of 1, and the k loop has a nesting level of 2.


for (i=0; i<17; i++) for (j=0; j<42; j++) for (k=0; k<1000; k++) do something;

Wallclock

The total amount of elapsed wallclock time spent executing this loop for the whole program. The elapsed time for an outer loop includes the elapsed time for an inner loop. For example:


for (i=1; i<10; i++) for (j=1; j<10; j++) do something;

The time assigned to the outer loop (the i loop) might be 10 seconds, and the time assigned to the inner loop (the j loop) might be 9.9 seconds.

Percentage

The percentage of total program runtime measured as wallclock time spent executing this loop. As with wallclock time, outer loops are credited with time spent in loops they contain.

Variable

The names of the variables that cause a data dependency in this loop. This field only appears when the compiler hint indicates that this loop suffers from a data dependency. The following illustrates a data dependency:


for (i=0; i<10; i++) { a[i] = b * c; d[i] = a[i] + e; }

This loop contains a data dependency--the variable a[i] must be computed before the variable d[i] can be computed. The variable d[i] is dependent on a[i].

Understanding Compiler Hints

LoopReport present you with somewhat cryptic hints about the optimizations applied to a particular loop, and the reason why a particular loop may not have been parallelized.

Note - The hints are gathered by the compiler during the optimization pass. They should be understood in that context; they are not absolute facts about the code generated for a given loop. However, the hints are often very useful indications of how you can transform your code so that the compiler can perform more aggressive optimizations, including parallelizing loops.

Some of the hints are redundant; that is, two hints may appear to mean essentially the same thing.

Let Sun know which of the hints help you or what other sorts of hints you need from the compiler. You can send feedback by using the Comment form available from the About box in the LoopTool GUI. See WorkShop: Beyond the Basics for more information about the LoopTool GUI.

Finally, read the sections in the Fortran User's Guide and C User's Guide that address parallelization. There are useful explanations and tips inside these manuals.

The table lists the optimization hints applied to loops.

Hint #
Hint Definition

0

No hint available

1

Loop contains procedure call

2

Compiler generated two versions of this loop

3

Loop contains data dependency

4

Loop was significantly transformed during optimization

5

Loop may or may not hold enough work to be profitably parallelized

6

Loop was marked by user-inserted pragma, DOALL

7

Loop contains multiple exits

8

Loop contains I/O, or other function calls, that are not MT safe

9

Loop contains backward flow of control

10

Loop may have been distributed

11

Two or more loops may have been fused

12

Two or more loops may have been interchanged

Hint #	Hint Definition
0	No hint available
1	Loop contains procedure call
2	Compiler generated two versions of this loop
3	Loop contains data dependency
4	Loop was significantly transformed during optimization
5	Loop may or may not hold enough work to be profitably parallelized
6	Loop was marked by user-inserted pragma, `DOALL`
7	Loop contains multiple exits
8	Loop contains I/O, or other function calls, that are not MT safe
9	Loop contains backward flow of control
10	Loop may have been distributed
11	Two or more loops may have been fused
12	Two or more loops may have been interchanged

0. No hint available

None of the other hints applied to this loop. That does not mean that none of the other hints might apply--it simply means that the compiler did not infer any of those hints.

1. Loop contains procedure call

The loop could not be parallelized since it contains a procedure call that is not MT safe. If such a loop were parallelized, there is a chance that multiple copies of the loop could instantiate the function call simultaneously, trample on each other's use of any variables local to that function, trample on return values, and generally invalidate the function's purpose. If you are certain that the procedure calls in this loop are MT safe, you can direct the compiler to parallelize this loop by inserting the DOALL pragma before the body of the loop. For example, if foo is an MT-safe function call, then you can force this inner loop to be parallelized by inserting c$par DOALL:


c$par DOALL do 19 i = 1, nn do 29 k = i, nn a(i) = a(i) + xb(k)c(i-k) call foo() 29 continue 19 continue

The DOALL pragmas are interpreted by the compiler only when you compile with -parallel or with -explicitpar; if you compile with -autopar, the compiler ignores the DOALL pragmas. This can be handy for debugging or fine-tuning.

2. Compiler generated two versions of this loop

The compiler couldn't tell, at compile time, if the loop contained enough work to be profitable to parallelize. The compiler generated two versions of the loop, a serial version and a parallel version, and a runtime check that will choose, at runtime, which version should execute. The runtime check determines the amount of work that the loop has to do by checking the loop iteration values.

3. Loop contains data dependency

A variable inside the loop is affected by the value of a variable in a previous iteration of the loop. For example:


do 99 i=1,n do 99 j = 1,m a[i, j+1] = a[i,j] + a[i,j-1] 99 continue

This is a contrived example because, for such a simple loop, the optimizer would simply swap the inner and outer loops so that the inner loop could be parallelized. But this example demonstrates the concept of data dependency, often referred to as data-carried dependency.

The compiler will often be able to tell you the names of the variables that cause the data-carried dependency. If you rearrange your program to remove (or minimize) such dependencies, the compiler will be able to perform more aggressive optimizations.

4. Loop was significantly transformed during optimization

The compiler performed some optimizations on this loop that might make it almost impossible to associate the generated code with the source code. For this reason, line numbers may be incorrect. Examples of optimizations that can radically alter a loop are loop distribution, loop fusion, and loop interchange (see Hint 10, Hint 11, and Hint 12).

5. Loop may or may not hold enough work to be profitably parallelized

The compiler was not able to determine at compile time whether this loop definitely held enough work to warrant the overhead of parallelizing. Often, loops that are labeled with this hint may also be labeled as "parallelized," meaning that the compiler generated two versions of the loop (see Hint 2), and that it will be decided at runtime whether the parallel version or the serial version should be used.

All the compiler hints, including the flag that indicates whether or not a loop is parallelized, are generated at compile time. There's no way to be certain that a loop labeled as "parallelized" actually executes in parallel. You need to perform additional runtime tracing, such as can be accomplished with the Thread Analyzer. You can compile your programs with both -Zlp (for LoopReport) and with -Ztha (for Thread Analyzer) and compare the analysis of both tools to get as much information as possible about the runtime behavior of your program.

6. Loop was marked by user-inserted pragma, DOALL

This loop was parallelized because the compiler was instructed to do so by the DOALL pragma. This hint helps you easily identify those loops that you explicitly wanted to parallelize.

The DOALL pragmas are interpreted by the compiler only when you compile with -parallel or -explicitpar; if you compile with -autopar, then the compiler will ignore the DOALL pragmas, which can be handy for debugging or fine-tuning.

7. Loop contains multiple exits

The loop contains a GOTO or some other branch out of the loop other than the natural loop end point. For this reason, it is not safe to parallelize the loop, since the compiler has no way of predicting the runtime behavior of the loop.

8. Loop contains I/O, or other function calls, that are not MT safe

This hint is similar to Hint 1; the difference is that this hint often focuses on I/O that is not MT safe, whereas Hint 1 could refer to any sort of MT-unsafe function call.

9. Loop contains backward flow of control

The loop contains a GOTO or other control flow up and out of the body of the loop. That is, some statement inside the loop jumps back to some previously executed portion of code, as far as the compiler control flow can determine. As with the case of a loop that contains multiple exits, this condition means that the loop is not safe to parallelize.

If you can reduce or minimize backward flows of control, the compiler will be able to perform more aggressive optimizations.

10. Loop may have been distributed

The contents of the loop may have been distributed over several iterations of the loop. That is, the compiler may have been able to rewrite the body of the loop so that the loop could be parallelized. However, since this rewriting takes place in the language of the internal representation of the optimizer, it's very difficult to associate the original source code with the rewritten version. For this reason, hints about a distributed loop may refer to line numbers that don't correspond to line numbers in your source code.

11. Two or more loops may have been fused

Two consecutive loops were combined into one, so that the resulting larger loop contains enough work to be profitably parallelized. Again, in this case, source line numbers for the loop may be misleading.

12. Two or more loops may have been interchanged

The loop indices of an inner and an outer loop have been swapped, to move data dependencies as far away from the inner loop as possible, and to enable this nested loop to be parallelized. In the case of deeply nested loops, the interchange may have occurred with more than two loops.