Previous Next Contents Index Doc Set Home


Analyzing Loops

3


The Fortran MP and MP C compilers automatically parallelize loops for which they determine that it is safe and profitable to do so. LoopTool is a performance analysis tool that reads loop timing files created by these compilers. LoopTool has a graphical user interface (GUI); LoopReport (which is discussed in Sun WorkShop: Command Line Utilities) is the command-line version of LoopTool.

This chapter is organized as follows:

Basic Concepts

page 23

Setting Up Your Environment

page 24

Creating a Loop Timing File

page 25

Starting LoopTool

page 26

Using LoopTool

page 27

Other Compilation Options

page 32

Compiler Hints

page 34

Compiler Optimizations and How They Affect Loops

page 38


Basic Concepts

LoopTool's main features include the ability to:

LoopTool displays a graph of loop runtimes and shows which loops were parallelized. You can go directly from the graphical display of loops to the source code for any loop you want, so you can edit your source code while in LoopTool.

LoopReport is the command-line version of LoopTool. For more information about LoopReport, see SunSoft WorkShop: Command Line Options.

Using LoopTool is like using gprof. The three major steps are: compile, run, and analyze.


Note - The following examples use the Fortran MP (f77 and f90) compiler. The options shown (such as -xparallel, -Zlp) also work for MP C.


Setting Up Your Environment

1. Before compiling, set the environment variable PARALLEL to the number of processors on your machine.

The following command makes use of psrinfo, a system utility. Note the backquotes:

% setenv PARALLEL `/usr/sbin/psrinfo | wc -l`


Note - If you have installed LoopTool in a nondefault directory, substitute that path for the one shown here.

2. Before starting LoopTool, make sure the environment variable XUSERFILESEARCHPATH is set:

% setenv XUSERFILESEARCHPATH \
/opt/SUNWspro/lib/sunpro_defaults/looptool.res

3. Set LD_LIBRARY_PATH.

If you are running Solaris 2.5:

% setenv LD_LIBRARY_PATH /usr/dt/lib:$LD_LIBRARY_PATH

If you are running Solaris 2.3 or 2.4:

% setenv LD_LIBARY_PATH \
 /opt/SUNWspro/Motif_Solaris24/dt/lib:$LD_LIBRARY_PATH

You may want to put these commands in a shell startup file (such as .cshrc or .profile).


Creating a Loop Timing File

To compile for automatic parallelization, typical compilation switches are -xparallel and -xO4. To compile for LoopTool, add -Zlp, as shown in the following example:

% f77 -xO4 -xparallel -Zlp source_file


Note - All examples apply to Fortran 77, Fortran 90 and C programs.
For additional information, see "Loading a Timing File" on page 26.

There are a number of other useful options for looking at and parallelizing loops. Some of these options are shown in Table 3-1 below.

Table  3-1 Some Useful Compiler Options

Option
Effect

-o program

Renames the executable to program

-xexplicitpar

Parallelizes loops marked with DOALL pragma

-xloopinfo

Prints hints to stderr for redirection to files

For more information, see "Other Compilation Options" on page 32.

Run The Program

After compiling with -Zlp, run the instrumented executable. This creates the loop timing file, program.looptimes. LoopTool processes two files: the instrumented executable and the loop timing file.


Starting LoopTool

You can start LoopTool by giving it the name of a program (that is, an executable) to load:

% looptool program &

You can also start the tools with no files specified. In this case, LoopTool's file chooser comes up automatically so you can select a file to examine:

% looptool &

LoopReport is usually started like this:

% loopreport program &

Loading a Timing File

LoopTool reads the timing file associated with your program. The timing file contains information about loops. Typically, this file has a name of the format program.looptimes and is in the same directory as your program.

By default, LoopTool looks in the executable's directory for a timing file. Therefore, if the timing file is there (the usual case), you don't need to specify where to look for it:

% looptool program &

If you name a timing file on the command line, then LoopTool and LoopReport use it.

% looptool program program.looptimes &

If you use the command line option -p, LoopTool and LoopReport check for a timing file in the directory indicated by -p:

% looptool -p timing_file_directory program &

If the environment variable LVPATH is set, the tools check that directory for a timing file.

% setenv LVPATH timing_file_directory
% looptool program &


Using LoopTool

The Main Window

The main window displays the runtimes of your program's loops in a bar chart arranged in the order that the source files were presented to the compiler.

Figure 3-1 shows the components of the main window.




 Click for closeup view.

Figure  3-1 LoopTool Main Window

Opening Files

Choose File Open from the File menu in the main window to open executable and timing files.

There are two ways to specify the files you want to open:

Once you've typed in the executable's path, you don't need to type in the timing file, unless it's in a different directory or has a non-default name (or both).

For more information about opening files, see the LoopTool section of the WorkShop Online Help.

Creating a Report on All Loops

Choose File Create Report from the File menu in the main window to open a window with detailed information on all the loops in your program (see Figure 3-2). The Help button in the report window links to the WorkShop Online Help section containing compiler hints.




 Click for closeup view.

Figure  3-2 LoopReport

Printing the LoopTool Graph

1. Choose File Print Graph from the File menu in the main window to open the Print pop-up window.

2. Choose whether to print the graph of put it in a file.

3. Enter the name of the printer or filename where you want to send the graph.

For more information about printing see the WorkShop Online Help.

Choosing an Editor

Choose File Options from the File menu in the main window to open the Options pop-up window.

The Options pop-up window lets you choose an editor for editing source code. The editors are vi, gnuemacs, and xemacs. See "Getting Hints and Editing Source Code" on page 30 for more on editing source code.


Note - vi and xemacs are installed with LoopTool into your install directory (usually /opt/SUNWspro/bin) if they're not already on your system. You must provide gnuemacs yourself. In all cases, the editor you want must be in a directory that's in your search path in order for LoopTool to find it. For example, your PATH environment variable should include /usr/ucb if that's where vi is located on your system.
For more information about choosing an editor see the WorkShop Online Help.

Getting Hints and Editing Source Code

Clicking a loop in the main window (Figure 3-1) does two things:

For information on vi, see the vi(1) manual page. xemacs and gnuemacs have online help (click the Help button).

The WorkShop vi editor has a special menu, Version, that allows you to make use of the SCCS (Source Code Control System) utility for sharing files. See the LoopTool online help, as well as the sccs(1) manual page, for more information.

3. It brings up a separate window that displays one or more hints about the loop you've selected. The Help button in this window displays the WorkShop online help compiler hints section. See also "Compiler Hints" on page 34, which explains the hints in detail.




 Click for closeup view.
Figure 3-3 shows the editor and hint windows:

Figure  3-3 The Editor and Hints Windows

Warning - If you edit your source code, line numbers shown by LoopTool may become inconsistent with the source. You must save and recompile the edited source and then run LoopTool with the new executable, producing new loop information, for the line numbers to remain consistent.

Getting Help and Sending Comments

Choose from the Help menu (shown in Figure 3-1) to:


Other Compilation Options

Many combinations of compile switches work for LoopTool.

Either -xO3 or -xO4 can be used with -xparallel. If you don't specify -xO3 or -xO4 but you do use -xparallel, then -xO3 is added. Table 3-2 summarizes how switches are added.

Table  3-2 Promotion of Compiler Switches

You type:
Bumped Up To:

-xparallel

-xparallel -xO3

-xparallel -Zlp

-xparallel -xO3 -Zlp

-xexplicitpar

-xexplicitpar -xO3

-xexplicitpar -Zlp

-xexplicitpar -xO3 -Zlp

-Zlp

-xdepend -xO3 -Zlp

Other compilation options include -xexplicitpar and -xloopinfo.

The Fortran MP compiler switch -xexplicitpar is used with the pragma DOALL. If you insert DOALL before a loop in your source code, you are explicitly marking that loop for parallelization. The compiler will parallelize this loop when you compile with -xexplicitpar.

The following code fragment shows how to mark a loop explicitly for parallelization.

	subroutine adj(a,b,c,x,n)
	  real*8 a(n), b(n), c(-n:0), x
	    integer n
c$par DOALL
	do 19 i = 1, n*n
	  do 29 k = i, n*n
	    a(i) = a(i) + x*b(k)*c(i-k)
29	  continue
19	continue
	return
	end

When you use -Zlp by itself, -xdepend and -xO3 are added. The switch -xdepend instructs the compiler to perform the data dependency analysis that it needs to do to identify loops. The switch -xparallel includes -xdepend, but -xdepend does not imply (or trigger) -xparallel.

The -xloopinfo option prints hints about loops to stderr (the UNIX standard error file, on file descriptor 2) when you compile your program. The hints include the routine names, the line number for the start of the loop, whether the loop was parallelized, and the reason it was not parallelized, if applicable.

The following example redirects hints about loops in the source file gamteb.F to the file gamtab.loopinfo:

% f77 -xO3 -parallel -xloopinfo -Zlp gamteb.F 2> gamteb.loopinfo

The main difference between -Zlp and -xloopinfo is that in addition to providing compiler hints about loops, -Zlp also instruments your program so that timing statistics are recorded at runtime. For this reason, also, LoopTool and LoopReport analyze only programs that have been compiled with -Zlp.


Compiler Hints

LoopTool and LoopReport present somewhat cryptic hints about the optimizations applied to a particular loop, and in particular, about why a particular loop may not have been parallelized. Some of the hints may seem to mean essentially the same thing.


Note - The hints are heuristics gathered by the compiler during the optimization pass. They should be understood in that context; they are not absolute facts about the code generated for a given loop. However, the hints are often very useful indications of how you can transform your code so that the compiler can perform more aggressive optimizations, including parallelizing loops.
For some useful explanations and tips, read the sections in the Sun WorkShop Fortran: User's Guide that address parallelization.

Table 3-3 lists the hints about optimizations applied to loops.

Table  3-3 LoopTool Hints 

Hint #
Hint Definition

0

No hint available

1

Loop contains procedure call

2

Compiler generated two versions of this loop

3

Loop contains data dependency

4

Loop was significantly transformed during optimization

5

Loop may or may not hold enough work to be profitably parallelized

6

Loop was marked by user-inserted pragma, DOALL

7

Loop contains multiple exits

8

Loop contains I/O, or other function calls, that are not MT safe

9

Loop contains backward flow of control

10

Loop may have been distributed

11

Two or more loops may have been fused

12

Two or more loops may have been interchanged

0. No hint available

None of the other hints applied to this loop. This hint does not mean that none of the other hints might apply; it means that the compiler did not infer any of those hints.

1. Loop contains procedure call

The loop could not be parallelized since it contains a procedure call that is not MT safe. If such a loop were parallelized, multiple copies of the loop might instantiate the function call simultaneously, trample on each other's use of any variables local to that function, or trample on return values, and generally invalidate the function's purpose. If you are certain that the procedure calls in this loop are MT safe, you can direct the compiler to parallelize this loop no matter what by inserting the DOALL pragma before the body of the loop. For example, if foo is an MT-safe function call, then you can force it to be parallelized by inserting c$par DOALL:

c$par DOALL
	 do 19 i = 1, n*n
		 do 29 k = i, n*n
			a(i) = a(i) + x*b(k)*c(i-k)
			call foo()
 29		continue
 19	 continue

The computer interprets the DOALL pragmas only when you compile with -parallel or -explicitpar; if you compile with -autopar, then the compiler ignores the DOALL pragmas.

2. Compiler generated two versions of this loop

The compiler couldn't tell at compile time if the loop contained enough work to be profitable to parallelize. The compiler generated two versions of the loop, a serial version and a parallel version, and a runtime check that will choose at runtime which version to execute. The runtime check determines the amount of work that the loop has to do by checking the loop iteration values.

3. Loop contains data dependency

A variable inside the loop is affected by the value of a variable in a previous iteration of the loop. For example:

do 99 i=1,n
	do 99 j = 1,m
		a[i, j+1] = a[i,j] + a[i,j-1]
99 continue

This is a contrived example, since for such a simple loop the optimizer would simply swap the inner and outer loops, so that the inner loop could be parallelized. But this example demonstrates the concept of data dependency, often referred to as "data-carried dependency."

The compiler will often be able to tell you the names of the variables that cause the data-carried dependency. If you rearrange your program to remove (or minimize) such dependencies, then the compiler will be able to perform more aggressive optimizations.

4. Loop was significantly transformed during optimization

The compiler performed some optimizations on this loop that might make it almost impossible to associate the generated code with the source code. For this reason, line numbers may be incorrect. Examples of optimizations that can radically alter a loop are loop distribution, loop fusion, and loop interchange (see Hint 10, Hint 11, and Hint 12).

5. Loop may or may not hold enough work to be profitably parallelized

The compiler was not able to determine at compile time whether this loop held enough work to warrant parallelizing. Often loops that are labeled with this hint may also be labeled "parallelized," meaning that the compiler generated two versions of the loop (see Hint 2), and that it will be decided at runtime whether the parallel version or the serial version should be used.

Since all the compiler hints, including the flag that indicates whether or not a loop is parallelized, are generated at compile time, there's no way to be certain that a loop labeled "parallelized" actually executes in parallel. To determine whether a loop executes in parallel, you need to perform additional runtime tracing, such as can be accomplished with the Thread Analyzer. You can compile your programs with both -Zlp (for LoopTool) and -Ztha (for Thread analyzer) and compare the analysis of both tools to get as much information as possible about your program's runtime behavior.

6. Loop was marked by user-inserted pragma, DOALL

This loop was parallelized because the compiler was instructed to do so by the DOALL pragma. This hint is a useful reminder to help you easily identify those loops that you explicitly wanted to parallelize.

The DOALL pragmas are interpreted by the compiler only when you compile with -parallel or -explicitpar; if you compile with -autopar, then the compiler will ignore the DOALL pragmas.

7. Loop contains multiple exits

The loop contains a GOTO or some other branch out of the loop other than the natural loop end point. For this reason, it is not safe to parallelize the loop, since the compiler has no way of predicting the loop's runtime behavior.

8. Loop contains I/O, or other function calls, that are not MT safe

This hint is similar to Hint 1; the difference is that this hint often focuses on I/O that is not MT safe, whereas Hint 1 can refer to any sort of MT-unsafe function call.

9. Loop contains backward flow of control

The loop contains a GOTO or other control flow up and out of the body of the loop. That is, some statement inside the loop appears to the compiler to jump back to some previously executed portion of code. As with the case of a loop that contains multiple exits, this loop is not safe to parallelize.

If you can reduce or minimize backward flows of control, the compiler will be able to perform more aggressive optimizations.

10. Loop may have been distributed

The contents of the loop may have been distributed over several iterations of the loop. That is, the compiler may have been able to rewrite the body of the loop so that it could be parallelized. However, since this rewriting takes place in the language of the internal representation of the optimizer, it's very difficult to associate the original source code with the rewritten version. For this reason, hints about a distributed loop may refer to line numbers that don't correspond to line numbers in your source code.

11. Two or more loops may have been fused

Two consecutive loops were combined into one, so the resulting larger loop contains enough work to be profitably parallelized. Again, in this case, source line numbers for the loop may be misleading.

12. Two or more loops may have been interchanged

The loop indices of an inner and an outer loop have been swapped, to move data dependencies as far away from the inner loop as possible, and to enable this nested loop to be parallelized. In the case of deeply nested loops, the interchange may have occurred with more than two loops.


Compiler Optimizations and How They Affect Loops

As you might infer from the descriptions of the compiler hints, associating optimized code with source code can be tricky. Clearly, you would prefer to see information from the compiler presented to you in a way that relates as directly as possible to your source code. Unfortunately, the compiler optimizer "reads" your program in terms of its internal language, and although it tries to relate that to your source code, it is not always successful.

Some particular optimizations that can cause confusion are described in the following sections.

Inlining

Inlining is an optimization applied only at optimization level -O4 and only for functions contained with one file. That is, if one file contains 17 Fortran functions, 16 of those can be inlined into the first function, and you compile at -O4, then the source code for those 16 functions may be copied into the body of the first function. Then, when further optimizations are applied, it becomes difficult to determine which loop on which source line number was subjected to which optimization.

If the compiler hints seem particularly opaque, consider compiling with -O3
-parallel -Zlp, so that you can see what the compiler says about your loops before it tries to inline any of your functions.

In particular, "phantom" loops--that is, loops that the compiler claims exist, but you know do not exist in your source code--could well be a symptom of inlining.

Loop Transformations--Unrolling, Jamming, Splitting, and Transposing

The compiler performs many loop optimizations that radically change the body of the loop. These include optimizations, unrolling, jamming, splitting, and transpositing.

LoopTool attempts to provide hints that make as much sense as possible, but given the nature of the problem of associating optimized code with source code, the hints may be misleading. For more information on what optimizations do for your code, refer to compiler books such as Compilers: Principles, Techniques and Tools by Aho, Sethi and Ullman.

Parallel Loops Nested Inside Serial Loops

If a parallel loop is nested inside a serial loop, the runtime information reported by LoopTool and LoopReport may be misleading because each loop is stipulated to use the wall-clock time of each of its loop iterations. If an inner loop is parallelized, it is assigned the wall-clock time of each iteration, although some of those iterations are running in parallel.

However, the outer loop is assigned only the runtime of its child, the parallel loop, which will be the runtime of the longest parallel instantiation of the inner loop. This double timing leads to the anomaly of the outer loop apparently consuming less time than the inner loop.




Previous Next Contents Index Doc Set Home