Previous Next Contents Index Doc Set Home


Using LoopReport

23


The Fortran MP and MP C compilers automatically parallelize loops that they determine safe and profitable to parallelize. LoopReport is a performance analysis tool that reads loop timing files created by these compilers.

This chapter is organized as follows:

Basic Concepts

page 247

Setting Up Your Environment

page 248

Creating a Loop Timing File

page 249

Starting LoopReport

page 249

Other Compilation Options

page 251

Fields in the Loop Report

page 254

Understanding Compiler Hints

page 255

Compiler Optimizations and How They Affect Loops

page 260


Basic Concepts

LoopReport is the command line version of LoopTool. LoopReport produces an ASCII file of loop times.

LoopReport's main features include the ability to:

Using LoopReport is similar to using gprof. The three major steps are: compile, run, and analyze.


Note - The following examples use the Fortran MP (f77) compiler. The options shown (such as -xparallel and -Zlp) work also for MP C.


Setting Up Your Environment

1. Before compiling, set the environment variable PARALLEL to the number of processors on your machine.

The following command makes use of psrinfo, a system utility. Note the backquotes:

% setenv PARALLEL `/usr/sbin/psrinfo | wc -l`


Note - If you have installed LoopReport in a non-default directory, substitute that path for the one shown here.

2. Before starting LoopReport, make sure the environment variable XUSERFILESEARCHPATH is set:

% setenv XUSERFILESEARCHPATH \
/opt/SUNWspro/lib/sunpro_defaults/looptool.res

3. Set LD_LIBRARY_PATH.

If you are running Solaris 2.5:

% setenv LD_LIBRARY_PATH /usr/dt/lib:$LD_LIBRARY_PATH

If you are running Solaris 2.4:

% setenv LD_LIBARY_PATH \
 /opt/SUNWspro/Motif_Solaris24/dt/lib:$LD_LIBRARY_PATH

You may want to put these commands in a shell start-up file (such as .cshrc or .profile).


Creating a Loop Timing File

To compile for automatic parallelization, typical compilation options are -xparallel and -xO4. To compile for LoopReport, add -Zlp, as shown in the following example:

% f77 -xO4 -xparallel -Zlp source_file


Note - All examples apply to Fortran77, Fortran90 and C programs.
After compiling with -Zlp, run the instrumented executable. This creates the loop timing file, program.looptimes. LoopReport processes two files: the instrumented executable and the loop timing file.


Starting LoopReport

When it starts up, LoopReport expects to be given the name of your program. Type loopreport and the name of the program (an executable) you want examined.

% loopreport program

You can also start LoopReport with no file specified. However, if you invoke LoopReport without giving it the name of a program, it looks for a file named a.out in the current working directory.

% loopreport  >  a.out.loopreport

You can also direct the output into a file, or pipe it into another command:

% loopreport program  >  program.loopreport
% loopreport program | more

Timing File

LoopReport also reads the timing file associated with your program. The timing file is created when you use the -zlp option, and contains information about loops. Typically, this file has a name of the format program.looptimes, and is found in the same directory as your program.

However, there are four ways to specify the location of a timing file. LoopReport chooses a timing file according to the rules listed below.


Other Compilation Options

To compile for automatic parallelization, typical compilation switches are -xparallel and -x04. To compile for LoopReport, add -Zlp.

% f77 -x04 -xparallel -Zlp source_file

There are several other useful options for examining and parallelizing loops.

Option
Effect

-o program

Renames the executable to program

-xexplicitpar

Parallelizes loops marked with DOALL pragma

-xloopinfo

Prints hints to stderr for redirection to files

Either -xO3 or -xO4 can be used with -xparallel. If you don't specify -xO3 or -xO4 but you do use -xparallel, then -xO3 is added. The table below summarizes how switches are added.

You type:
Bumped Up:

-xparallel

-xparallel -xO3

-xparallel -Zlp

-xparallel -xO3 -Zlp

-xexplicitpar

-xexplicitpar -xO3

-xexplicitpar -Zlp

-xexplicitpar -xO3 -Zlp

-Zlp

-xdepend -xO3 -Zlp

The -xexplicitpar and -xloopinfo have specific applications.

-xexplicitpar

The Fortran MP compiler switch -xexplicitpar is used with the pragma DOALL. If you insert DOALL before a loop in your source code, you are explicitly marking that loop for parallelization. The compiler will parallelize this loop when you compile with -xexplicitpar.

The following code fragment shows how to mark a loop explicitly for parallelization.

	subroutine adj(a,b,c,x,n)
	  real*8 a(n), b(n), c(-n:0), x
	    integer n
c$par DOALL
	do 19 i = 1, n*n
	  do 29 k = i, n*n
	    a(i) = a(i) + x*b(k)*c(i-k)
29	  continue
19	continue
	return
	end

When you use -Zlp by itself, -xdepend and -xO3 are added. The switch -xdepend instructs the compiler to perform the data dependency analysis that it needs to do to identify loops. The switch -xparallel includes -xdepend, but -xdepend does not imply (or trigger) -xparallel.

-xloopinfo

The -xloopinfo option prints hints about loops to stderr (the UNIX standard error file, on file descriptor 2) when you compile your program. The hints include the routine names, line number for the start of the loop, whether the loop was parallelized, and, if appropriate, the reason it was not parallelized.

The following example redirects hints about loops in the source file gamteb.F to the file named gamteb.loopinfo.

% f77 -xO3 -parallel -xloopinfo -Zlp gamteb.F 2> gamteb.loopinfo

The main difference between -Zlp and -xloopinfo is that in addition to providing you with compiler hints about loops, -Zlp also instruments your program so that timing statistics are recorded at runtime. For this reason, also, LoopReport analyzes only programs that have been compiled with -Zlp.

Figure  23-1 Sample Loop Report


Fields in the Loop Report

The loop report contains the following information:

"Parallelized by the compiler?" Y means that this loop was marked for parallelization; N means that the loop was not.

Number of times this loop was entered from above. This is distinct from the number of loop iterations, which is the total number of times a loop executes. For example, these are two loops in Fortran.

do 10 i=1,17
	do 10  j=1,50
		...some code...
	10 continue

The first loop is entered once, and it iterates 17 times. The second loop is entered 17 times, and it iterates 17*50 = 850 times.

  • Nest

    Nesting level of the loop. If a loop is a top-level loop, its nesting level is 0. If the loop is the child of another loop, its nesting level is 1.

    For example, in this C code, the i loop has a nesting level of 0, the j loop has a nesting level of 1, and the k loop has a nesting level of 2.

    for (i=0; i<17; i++)
    
    	for (j=0; j<42; j++)
    
    		for (k=0; k<1000; k++)
    
    			do something;
    

  • Wallclock

    The total amount of elapsed wallclock time spent executing this loop for the whole program. The elapsed time for an outer loop includes the elapsed time for an inner loop. For example:

    for (i=1; i<10; i++)
    
    	for (j=1; j<10; j++)
    
    		do something; 
    

    The time assigned to the outer loop (the i loop) might be 10 seconds, and the time assigned to the inner loop (the j loop) might be 9.9 seconds.

  • The percentage of total program runtime measured as wallclock time spent executing this loop. As with wallclock time, outer loops are credited with time spent in loops they contain.

    The names of the variables that cause a data dependency in this loop. This field only appears when the compiler hint indicates that this loop suffers from a data dependency. The following illustrates a data dependency:

    for  (i=0; i<10; i++) {
    
    	a[i] = b * c;
    
    	d[i] = a[i] + e;
    
    }
    

    This loop contains a data dependency--the variable a[i] must be computed before the variable d[i] can be computed. The variable d[i] is dependent on a[i].


    Understanding Compiler Hints

    LoopReport present you with somewhat cryptic hints about the optimizations applied to a particular loop, and the reason why a particular loop may not have been parallelized.


    Note - The hints are gathered by the compiler during the optimization pass. They should be understood in that context; they are not absolute facts about the code generated for a given loop. However, the hints are often very useful indications of how you can transform your code so that the compiler can perform more aggressive optimizations, including parallelizing loops.
    Some of the hints are redundant; that is, two hints may appear to mean essentially the same thing.

    Let Sun know which of the hints help you or what other sorts of hints you need from the compiler. You can send feedback by using the Comment form available from the About box in the LoopTool GUI. See WorkShop: Beyond the Basics for more information about the LoopTool GUI.

    Finally, read the sections in the Fortran User's Guide and C User's Guide that address parallelization. There are useful explanations and tips inside these manuals.

    The table lists the optimization hints applied to loops.

    Hint #
    Hint Definition  

    0

    No hint available

    1

    Loop contains procedure call

    2

    Compiler generated two versions of this loop

    3

    Loop contains data dependency

    4

    Loop was significantly transformed during optimization

    5

    Loop may or may not hold enough work to be profitably parallelized

    6

    Loop was marked by user-inserted pragma, DOALL

    7

    Loop contains multiple exits

    8

    Loop contains I/O, or other function calls, that are not MT safe

    9

    Loop contains backward flow of control

    10

    Loop may have been distributed

    11

    Two or more loops may have been fused

    12

    Two or more loops may have been interchanged

    0. No hint available

    None of the other hints applied to this loop. That does not mean that none of the other hints might apply--it simply means that the compiler did not infer any of those hints.

    1. Loop contains procedure call

    The loop could not be parallelized since it contains a procedure call that is not MT safe. If such a loop were parallelized, there is a chance that multiple copies of the loop could instantiate the function call simultaneously, trample on each other's use of any variables local to that function, trample on return values, and generally invalidate the function's purpose. If you are certain that the procedure calls in this loop are MT safe, you can direct the compiler to parallelize this loop by inserting the DOALL pragma before the body of the loop. For example, if foo is an MT-safe function call, then you can force this inner loop to be parallelized by inserting c$par DOALL:

    c$par DOALL
    do 19 i = 1, n*n
    do 29 k = i, n*n
    a(i) = a(i) + x*b(k)*c(i-k)
    call foo()
    29 continue
    19 continue

    The DOALL pragmas are interpreted by the compiler only when you compile with -parallel or with -explicitpar; if you compile with -autopar, the compiler ignores the DOALL pragmas. This can be handy for debugging or fine-tuning.

    2. Compiler generated two versions of this loop

    The compiler couldn't tell, at compile time, if the loop contained enough work to be profitable to parallelize. The compiler generated two versions of the loop, a serial version and a parallel version, and a runtime check that will choose, at runtime, which version should execute. The runtime check determines the amount of work that the loop has to do by checking the loop iteration values.

    3. Loop contains data dependency

    A variable inside the loop is affected by the value of a variable in a previous iteration of the loop. For example:

    do 99 i=1,n
    do 99 j = 1,m
    a[i, j+1] = a[i,j] + a[i,j-1]
    99 continue

    This is a contrived example because, for such a simple loop, the optimizer would simply swap the inner and outer loops so that the inner loop could be parallelized. But this example demonstrates the concept of data dependency, often referred to as data-carried dependency.

    The compiler will often be able to tell you the names of the variables that cause the data-carried dependency. If you rearrange your program to remove (or minimize) such dependencies, the compiler will be able to perform more aggressive optimizations.

    4. Loop was significantly transformed during optimization

    The compiler performed some optimizations on this loop that might make it almost impossible to associate the generated code with the source code. For this reason, line numbers may be incorrect. Examples of optimizations that can radically alter a loop are loop distribution, loop fusion, and loop interchange (see Hint 10, Hint 11, and Hint 12).

    5. Loop may or may not hold enough work to be profitably parallelized

    The compiler was not able to determine at compile time whether this loop definitely held enough work to warrant the overhead of parallelizing. Often, loops that are labeled with this hint may also be labeled as "parallelized," meaning that the compiler generated two versions of the loop (see Hint 2), and that it will be decided at runtime whether the parallel version or the serial version should be used.

    All the compiler hints, including the flag that indicates whether or not a loop is parallelized, are generated at compile time. There's no way to be certain that a loop labeled as "parallelized" actually executes in parallel. You need to perform additional runtime tracing, such as can be accomplished with the Thread Analyzer. You can compile your programs with both -Zlp (for LoopReport) and with -Ztha (for Thread Analyzer) and compare the analysis of both tools to get as much information as possible about the runtime behavior of your program.

    6. Loop was marked by user-inserted pragma, DOALL

    This loop was parallelized because the compiler was instructed to do so by the DOALL pragma. This hint helps you easily identify those loops that you explicitly wanted to parallelize.

    The DOALL pragmas are interpreted by the compiler only when you compile with -parallel or -explicitpar; if you compile with -autopar, then the compiler will ignore the DOALL pragmas, which can be handy for debugging or fine-tuning.

    7. Loop contains multiple exits

    The loop contains a GOTO or some other branch out of the loop other than the natural loop end point. For this reason, it is not safe to parallelize the loop, since the compiler has no way of predicting the runtime behavior of the loop.

    8. Loop contains I/O, or other function calls, that are not MT safe

    This hint is similar to Hint 1; the difference is that this hint often focuses on I/O that is not MT safe, whereas Hint 1 could refer to any sort of MT-unsafe function call.

    9. Loop contains backward flow of control

    The loop contains a GOTO or other control flow up and out of the body of the loop. That is, some statement inside the loop jumps back to some previously executed portion of code, as far as the compiler control flow can determine. As with the case of a loop that contains multiple exits, this condition means that the loop is not safe to parallelize.

    If you can reduce or minimize backward flows of control, the compiler will be able to perform more aggressive optimizations.

    10. Loop may have been distributed

    The contents of the loop may have been distributed over several iterations of the loop. That is, the compiler may have been able to rewrite the body of the loop so that the loop could be parallelized. However, since this rewriting takes place in the language of the internal representation of the optimizer, it's very difficult to associate the original source code with the rewritten version. For this reason, hints about a distributed loop may refer to line numbers that don't correspond to line numbers in your source code.

    11. Two or more loops may have been fused

    Two consecutive loops were combined into one, so that the resulting larger loop contains enough work to be profitably parallelized. Again, in this case, source line numbers for the loop may be misleading.

    12. Two or more loops may have been interchanged

    The loop indices of an inner and an outer loop have been swapped, to move data dependencies as far away from the inner loop as possible, and to enable this nested loop to be parallelized. In the case of deeply nested loops, the interchange may have occurred with more than two loops.


    Compiler Optimizations and How They Affect Loops

    As you might infer from the descriptions of the compiler hints, it can be tricky to associate optimized code with source code. Clearly, you would prefer to see information from the compiler presented to you in a way that relates as directly as possible to your source code. After all, most people don't care about the compiler's internal representation of their programs. Unfortunately, the compiler optimizer is "reading" your program in terms of this internal language. Although it tries its best to relate that to your source code, it is not always successful.

    Inlining

    Inlining is an optimization applied only at optimization level -O4 and only for functions contained with one file. Suppose one file contains 17 Fortran functions, and 16 of those can be inlined into the first function. If you compile the file using -O4, the file with the source code for those 16 functions could be copied into the body of the first function. When further optimizations are applied, and it becomes difficult to say which loop on which source line was subjected to which optimization.

    If the compiler hints seem particularly opaque, consider compiling with -O3
    -parallel -Zlp, so that you can see what the compiler has to say about your loops before it tries to inline any of your functions.

    In particular, if you notice "phantom" loops--that is, loops that the compiler claims to exist, but which you know do not exist in your source code--this could well be a symptom of inlining.

    Loop Transformations -- Unrolling, Jamming, Splitting, and Transposing

    The compiler performs many optimizations on loops that radically change the body of a loop. These include loop unrolling, loop jamming, loop splitting, and loop transposition.

    LoopReport attempts to provide you with hints that make as much sense as possible. Given the nature of the problem of associating optimized code with source code, however, the hints may be misleading. If you are interested in this topic, refer to compiler books such as Compilers: Principles, Techniques and Tools by Aho, Sethi, and Ullman, for more information on what these optimizations do for your code.

    Parallel Loops Nested Inside Serial Loops

    If a parallel loop is nested inside a serial loop, the runtime information reported by LoopReport may be misleading.

    Each loop is stipulated to use the wallclock time of each of its loop iterations. If an inner loop is parallelized, it is assigned the wallclock time of each iteration, although some of those iterations are running in parallel.

    However, the outer loop is only assigned the runtime of its child, the parallel loop, which will be the runtime of the longest parallel instantiating of the inner loop.

    This leads to the anomaly of the outer loop consuming "less" time than the inner loop. Keep in mind the funny nature of time when measuring events that occur in parallel, and this will help keep all wallclock times in perspective.




    Previous Next Contents Index Doc Set Home