PowerPC Behavior and Implementation

D

This appendix describes implementation and compatibility issues related to the floating-point units used in PowerPC processors. See "PowerPC Architecture, First Edition" as well as the PowerPC 603 and 604 User's Manuals for further information on PowerPC implementations.

Floating-Point Operations in the PowerPC Architecture

In addition to instructions for conventional unary and binary floating-point operations such as addition, subtraction, multiplication, division, floating- point to integer conversion, negation, absolute value, and comparison, the PowerPC architecture includes several instructions that perform variations of a ternary operation sometimes called a "fused multiply-add".

A fused multiply-add operation consists of a product and sum, a x b + c, computed with only one final rounding; the intermediate product a x b is not rounded. Because of this, the evaluation of an expression involving such a product and sum (for example, in an inner product or saxpy) can produce different results depending on whether the compiler generates a fused multiply-add instruction or a sequence of two instructions, a multiply followed by an add. Typically, the expression evaluated using a fused multiply-add operation will be at least as accurate, and possibly more accurate than the expression evaluated using separate multiply and add instructions, since the latter will incur two rounding errors rather than one. In fact, some codes can be written specifically to exploit the accuracy of the fused multiply-add. As is often true in numerical software, however, there are exceptions: some programs that expect each arithmetic operation to be rounded to working precision may function incorrectly when fused multiply-add instructions are used. The -xarch=ppc_nofma flag can be used to disable the generation of fused multiply-add instructions so that such programs can function correctly.

Two of the fused multiply-add instructions, the fnmadd and fnmsub instructions, negate their result, producing -(a x b + c) and c - a x b respectively. Note, however, that when an expression that does not produce an exactly representable result is evaluated in either round-to-positive-infinity or round-to-negative-infinity mode, the negation of the result does not produce the same value as the evaluation of the negated expression. (That is, if x is the result of evaluating expression a rounding toward positive infinity, -x is the result of evaluating -a rounding toward negative infinity.) Thus, when directed rounding modes are used, as in interval arithmetic, for example, care must be taken to avoid inadvertently reversing the sense of rounding by negation. Since the compiler cannot tell that a directed rounding mode is in effect, it may interpret an expression as involving a negation in order to issue a negated fused multiply-add instruction (for performance reasons); to avoid this problem, the -xarch=ppc_nofma flag should be used to disable generation of fused multiply-add instructions when directed rounding modes are used.

Floating-Point Status and Control Register

The PowerPC floating-point unit includes a floating-point status and control register (FPSCR) that contains flags that reflect the current state of the floating-point unit as well as mode bits that control its operation. (Some of these flags and mode bits implement features required or recommended by the IEEE standard.) The FPSCR is accessible to user-level code either via the IEEE support functions such as ieee_flags (see Chapter 3, "The Math Libraries") or from assembly language subroutines that can use special PowerPC instructions to read and modify it.

The PowerPC Architecture manual shows the bit assignments of the Floating- Point Status and Control Register. Note that the floating-point exception trap enable bits (VE, OE, UE, ZE, XE), the non-IEEE mode bit (NI), and the rounding mode bits (RN) are control bits: modifying these bits can alter the behavior of the floating-point unit for subsequent operations. The remaining bits, namely the exception summary bits, accrued exception bits, fraction rounded and fraction inexact bits, and result flags, are status bits: they record information about results computed in previous operations.

Floating-point Exceptions and Unimplemented Floating-Point Instructions

When a floating-point exception occurs and the corresponding trap enable bits not set, the untrapped default result specified by IEEE 754 is delivered to the destination register, the corresponding exception bit in the FPSCR is set (causing the overall exception summary bit and, in the case of an invalid operation exception, the invalid operation exception summary bit, also to be set), and execution continues.

When a floating-point exception occurs and the corresponding trap enable bit is set, if the exception is underflow, overflow, or inexact, a trapped default result specified by IEEE 754 is delivered to the destination register before a trap is taken. (Regardless of the exception type, the corresponding exception bit in the FPSCR is also set, causing the overall exception summary bit, the enabled exception summary bit, and, in the case of an invalid operation exception, the invalid operation exception summary bit to be set.) Thus a user trap handler sees the floating-point registers in the ucontext structure with the trapped default result already stored in the destination of the trapping instruction; if this destination is the same as one of the operand registers, that operand will not be available to the user trap handler.

The PowerPC 603 and 604 floating-point units implement all the floating-point instructions in the 32-bit PowerPC architecture except the fsqrt and fsqrts instructions. Since the Solaris kernel currently does not emulate unimplemented PowerPC instructions, attempting to execute an instruction that is not implemented in hardware results in an illegal instruction trap. Normally this causes the user's program to abort; alternatively, the user can install a SIGILL signal handler and handle the trap accordingly. Note that the compiler never generates fsqrt or fsqrts instructions itself, however, so unless a program includes user-supplied assembly code with these instructions, there is no need to install a SIGILL handler to emulate them.

Gradual Underflow

The PowerPC 603 and 604 all handle subnormal operands and results entirely in hardware. The NI (non-IEEE) mode bit in the FPSCR is ignored on the 603. On the 604, the NI mode bit may be used to obtain flush-to-zero treatment of underflow, but there is no performance benefit in doing so. For this reason, the -fns compiler flag and the nonstandard_arithmetic function in libsunmath (see Chapter 3, "The Math Libraries") have no effect on PowerPC systems.

Example: Using Fused Multiply-Add

The following code illustrates the explicit use of fused multiply-add operations to achieve better accuracy than a separate multiply and add. This C++ code implements the basic operations for a data type ddouble that simulates arithmetic with roughly twice the precision of double. To use this code, compile the C++ routines below together with the following inline template. (Do not use the -fsimple option.)


.inline fmsub,0 fmsub %f1,%f1,%f2,%f3 .end

#include <math.h> class ddouble { public: double hi, lo; }; // This algorithm is adapted from one devised by Prof. W. Kahan // in 1989. The following copyright notice appeared on the // original. // // (C) W. Kahan 1989 // // NOTICE: // Copyrighted programs may not be translated, used, nor // reproduced without the author's permission. Normally that // perfmission is granted freely for academic and scientific // purposes subject to the following three requirements: // 1. This NOTICE and the copyright notices must remain // attached to the programs and their translations. // 2. Users of such a program should regard themselves as // voluntary participants in the author's researches and // so are obliged to report their experience with the program // back to the author. // 3. Neither the author nor the institution that employs him // will be held responsible for the consequences of using a // program for which neither has received payment. // Would-be commercial users of these programs must correspond // with the author to arrange terms of payment and warranty. ddouble operator +(ddouble x, ddouble y) { ddouble z; double H, h, T, t, S, s, e, f; S = x.hi + y.hi; T = x.lo + y.lo; e = S - x.hi; f = T - x.lo; s = (y.hi - e) + (x.hi - (S - e)); t = (y.lo - f) + (x.lo - (T - f)); H = S + (s + T); h = (s + T) + (S - H); z.hi = H + (t + h); z.lo = (t + h) + (H - z.hi); return z; } ddouble operator -(ddouble x, ddouble y) { ddouble z; z.hi = -y.hi; z.lo = -y.lo; return x + z; } // The following algorithms are adapted from those published // by T. J. Dekker in "A Floating Point Technique for // the Available Precision", Numer. Math. 18 (1971), pp. 224-242. extern "C" { extern double fmsub(double, double, double); }; ddouble operator *(ddouble x, ddouble y) { ddouble z; double C, c; C = x.hi * y.hi; c = fmsub(x.hi, y.hi, C) + (x.hi * y.lo + x.lo * y.hi); z.hi = C + c; z.lo = c + (C - z.hi); return z; } ddouble operator /(ddouble x, ddouble y) { ddouble z; double C, c, U, u; C = x.hi / y.hi; U = C * y.hi; u = fmsub(C, y.hi, U); c = (((x.hi - U) - u) + x.lo - C * y.lo) / y.hi; z.hi = C + c; z.lo = c + (C - z.hi); return z; } ddouble sqrt(ddouble x) { ddouble z; double C, c, U, u; C = sqrt(x.hi); U = C * C; u = fmsub(C, C, U); c = (((x.hi - U) - u) + x.lo) / (C + C); z.hi = C + c; z.lo = c + (C - z.hi); return z; }