IEEE Arithmetic

2

This chapter discusses the IEEE Standard 754, the arithmetic model specified by the IEEE Standard for Binary Floating-Point Arithmetic (IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985 (IEEE 754)). All SPARC, Intel, and PowerPC computers use IEEE arithmetic. All Sun compiler products support the features of IEEE arithmetic.

This chapter is organized into the following sections:

IEEE Arithmetic Model

page 3

IEEE Formats

page 5

Underflow

page 22

IEEE Arithmetic Model	page 3
IEEE Formats	page 5
Underflow	page 22

IEEE Arithmetic Model

This section describes the IEEE Standard 754 specification.

What Is IEEE Arithmetic?

The IEEE Standard 754 specifies:

Two basic floating-point formats: single and double.

The IEEE single format has a precision of 24 bits (24-bit significands), and 32 bits overall. The IEEE double format has a precision of 53 bits, and 64 bits overall.

Two classes of extended floating-point formats: single extended and double extended.

Any format in the class of IEEE double extended formats has a precision at least of 64 bits, and at least 79 bits overall.

Accuracy requirements on floating-point operations: add, subtract, multiply, divide, square root, remainder, round numbers in floating-point format to integer values, convert between different floating-point formats, convert between floating-point and integer formats, and compare.

The remainder and compare operations must be exact. Each of the other operations must deliver to its destination the exact result, unless there is no such result; or that result does not fit in the destination's format. In the latter case, the operation must minimally modify the exact result according to the rules of prescribed rounding modes, presented below, and deliver the result so modified to the operation's destination.

Accuracy, monotonicity and identity requirements for conversions between decimal strings and binary floating-point numbers in either of the basic floating-point formats.

For operands lying within specified ranges, these conversions must produce exact results, if possible, or minimally modify such exact results in accordance with the rules of the prescribed rounding modes. For operands not lying within the specified ranges, these conversions must produce results that differ from the exact result by no more than a specified tolerance that depends on the rounding mode.

Five types of IEEE floating-point exceptions, and the conditions for indicating to the user the occurrence of exceptions of these types.

The five types of floating-point exceptions are invalid operation, division by zero, overflow, underflow, and inexact.

Four rounding directions: toward the nearest representable value, with "even" values preferred whenever there are two nearest representable values; toward - (down); toward + (up); and toward 0 (chop).
Rounding precision; for example, if a system delivers results in double extended format, the user should be able to specify that such results are to be rounded to the precision of either basic format, with trailing zeros.

The Standard supports user handling of exceptions, rounding, and precision. Consequently, the Standard supports interval arithmetic, and diagnosis of anomalies. The IEEE Standard 754 makes it possible to standardize elementary functions like exp and cos, to create very high-precision arithmetic, and to couple numerical and symbolic algebraic computation.

IEEE Standard 754 floating-point arithmetic offers users greater control over computation than does any other kind of floating-point arithmetic. The IEEE Standard 754 simplifies the task of writing numerically sophisticated, portable programs not only by imposing rigorous requirements on conforming implementations. The Standard also allows such implementations to provide refinements and enhancements to the Standard itself.

IEEE Formats

This section describes how floating-point data is stored in memory. It summarizes the precisions and ranges of the different IEEE storage formats.

Storage Formats

A floating-point format is a data structure specifying the fields that comprise a floating-point numeral, the layout of those fields, and their arithmetic interpretation. A floating-point storage format specifies how a floating-point format is stored in memory. The IEEE standard defines the formats, but it leaves to implementors the choice of storage formats.

Assembly language software sometimes relies on using the storage formats, but higher level languages usually deal only with the linguistic notions of floating-point data types. These types have different names in different high-level languages, and correspond to the IEEE formats as shown in Table 2-1.

Table 2-1 IEEE Formats and Language Types

IEEE Precision
C, C++

FORTRAN

single

float

REAL or REAL*4

double

double

DOUBLE PRECISION or REAL*8

double
extended

long double

REAL*16
[SPARC and PowerPC only]

Table 2-1 IEEE Formats and Language Types
IEEE Precision	C, C++	FORTRAN
single	float	REAL or REAL*4
double	double	DOUBLE PRECISION or REAL*8
double extended	long double	REAL*16 [SPARC and PowerPC only]

IEEE Standard 754 specifies exactly the single and double floating-point formats, and it defines a class of extended formats for each of these two basic formats. The format called double extended in Table 2-1 is one of the class of double extended formats defined by the IEEE standard.

The following sections describe in detail each of the three storage formats used for the IEEE floating-point formats.

Single Format

The IEEE single format consists of three fields: a 23-bit fraction, f; an 8-bit biased exponent, e; and a 1-bit sign, s. These fields are stored contiguously in one 32-bit word, as shown in Figure 2-1. Bits 0:22 contain the 23-bit fraction, f, with bit 0 being the least significant bit of the fraction and bit 22 being the most significant; bits 23:30 contain the 8-bit biased exponent, e, with bit 23 being the least significant bit of the biased exponent and bit 30 being the most significant; and the highest-order bit 31 contains the sign bit, s.

Figure 2-1 Single-Storage Format

Table 2-2 shows the correspondence between the values of the three constituent fields s, e and f, on the one hand, and the value represented by the single- format bit pattern on the other; u means don't care, that is, the value of the indicated field is irrelevant to the determination of the value of the particular bit patterns in single format.

Table 2-2 Values Represented by Bit Patterns in IEEE Single Format
Single-Format Bit Pattern	Value
0 < `e` < 255	(-1)^s x 2^e-127 x 1.`f` (normal numbers)
`e` = 0; `f` 0 (at least one bit in `f` is nonzero)	(-1)^s x 2^-126 x 0.`f` (subnormal numbers)
`e` = 0; `f` = 0 (all bits in `f` are zero)	(-1)^s x 0.0 (signed zero)
`s` = 0; `e` = 255; `f` = 0 (all bits in `f` are zero)	+INF (positive infinity)
`s` = 1; `e` = 255; `f` = 0 (all bits in `f` are zero)	-INF (negative infinity)
`s` = u; `e` = 255; `f` 0 (at least one bit in `f` is nonzero)	NaN (Not-a-Number)

Notice that when e < 255, the value assigned to the single format bit pattern is formed by inserting the binary radix point immediately to the left of the fraction's most significant bit, and inserting an implicit bit immediately to the left of the binary point, thus representing in binary positional notation a mixed number (whole number plus fraction, wherein 0 <= fraction < 1).

The mixed number thus formed is called the single-format significand. The implicit bit is so named because its value is not explicitly given in the single- format bit pattern, but is implied by the value of the biased exponent field.

For the single format, the difference between a normal number and a subnormal number is that the leading bit of the significand (the bit to left of the binary point) of a normal number is 1, whereas the leading bit of the significand of a subnormal number is 0. Single-format subnormal numbers were called single-format denormalized numbers in IEEE Standard 754.

The 23-bit fraction combined with the implicit leading significand bit provides 24 bits of precision in single-format normal numbers.

Examples of important bit patterns in the single-storage format are shown in Table 2-3. The maximum positive normal number is the largest finite number representable in IEEE single format. The minimum positive subnormal number is the smallest positive number representable in IEEE single format. The minimum positive normal number is often referred to as the underflow threshold. (The decimal values for the maximum and minimum normal and subnormal numbers are approximate; they are correct to the number of figures shown.)

Table 2-3 Bit Patterns in Single-Storage Format and their IEEE Values

Common Name
Bit Pattern (Hex)
Decimal Value

+ 0

00000000

0.0

- 0

80000000

-0.0

1

3f800000

1.0

2

40000000

2.0

maximum normal number

7f7fffff

3.40282347e+38

minimum positive normal number

00800000

1.17549435e-38

maximum subnormal number

007fffff

1.17549421e-38

minimum positive subnormal number

00000001

1.40129846e-45

+

7f800000

Infinity

-

ff800000

-Infinity

Not-a-Number

7fc00000

NaN

Table 2-3 Bit Patterns in Single-Storage Format and their IEEE Values
Common Name	Bit Pattern (Hex)	Decimal Value
+ 0	`00000000`	0.0
- 0	`80000000`	-0.0
1	`3f800000`	1.0
2	`40000000`	2.0
maximum normal number	`7f7fffff`	3.40282347e+38
minimum positive normal number	`00800000`	1.17549435e-38
maximum subnormal number	`007fffff`	1.17549421e-38
minimum positive subnormal number	`00000001`	1.40129846e-45
+	`7f800000`	Infinity
-	`ff800000`	-Infinity
Not-a-Number	`7fc00000`	NaN

A NaN (Not a Number) can be represented with any of the many bit patterns that satisfy the definition of a NaN. The hex value of the NaN shown in Table 2-3 is just one of the many bit patterns that can be used to represent a NaN.

Double Format

The IEEE double format consists of three fields: a 52-bit fraction, f; an 11-bit biased exponent, e; and a 1-bit sign, s. These fields are stored contiguously in two successively addressed 32-bit words, as shown in Figure 2-2.

In the SPARC architecture, the higher address 32-bit word contains the least significant 32 bits of the fraction, while in the Intel and PowerPC architectures the lower address 32-bit word contains the least significant 32 bits of the fraction.

If we denote f[31:0] the least significant 32 bits of the fraction, then bit 0 is the least significant bit of the entire fraction and bit 31 is the most significant of the 32 least significant fraction bits.

In the other 32-bit word, bits 0:19 contain the 20 most significant bits of the fraction, f[51:32], with bit 0 being the least significant of these 20 most significant fraction bits, and bit 19 being the most significant bit of the entire fraction; bits 20:30 contain the 11-bit biased exponent, e, with bit 20 being the least significant bit of the biased exponent and bit 30 being the most significant; and the highest-order bit 31 contains the sign bit, s.

Figure 2-2 numbers the bits as though the two contiguous 32-bit words were one 64-bit word in which bits 0:51 store the 52-bit fraction, f; bits 52:62 store the 11-bit biased exponent, e; and bit 63 stores the sign bit, s.

Figure 2-2 Double-Storage Format

The values of the bit patterns in these three fields determine the value represented by the overall bit pattern.

Table 2-4 shows the correspondence between the values of the bits in the three constituent fields, on the one hand, and the value represented by the double- format bit pattern on the other; u means don't care, because the value of the indicated field is irrelevant to the determination of value for the particular bit pattern in double format.

Table 2-4 Values Represented by Bit Patterns in IEEE Double Format

Double-Format Bit Pattern
Value

0 < e < 2047

(-1)^s x 2^e-1023 x 1.f (normal numbers)

e = 0; f 0
(at least one bit in f is nonzero)

(-1)^s x 2^-1022 x 0.f (subnormal numbers)

e = 0; f = 0
(all bits in f are zero)

(-1)^s x 0.0 (signed zero)

s = 0; e = 2047; f = 0
(all bits in f are zero)

+INF (positive infinity)

s = 1; e = 2047; f = 0
(all bits in f are zero)

-INF (negative infinity)

s = u; e = 2047; f 0
(at least one bit in f is nonzero)

NaN (Not-a-Number)

Table 2-4 Values Represented by Bit Patterns in IEEE Double Format
Double-Format Bit Pattern	Value
0 < `e` < 2047	(-1)^s x 2^e-1023 x 1.`f` (normal numbers)
`e` = 0; `f` 0 (at least one bit in `f` is nonzero)	(-1)^s x 2^-1022 x 0.`f` (subnormal numbers)
`e` = 0; `f` = 0 (all bits in `f` are zero)	(-1)^s x 0.0 (signed zero)
`s` = 0; `e` = 2047; `f` = 0 (all bits in `f` are zero)	+INF (positive infinity)
`s` = 1; `e` = 2047; `f` = 0 (all bits in `f` are zero)	-INF (negative infinity)
`s` = u; `e` = 2047; `f` 0 (at least one bit in `f` is nonzero)	NaN (Not-a-Number)

Notice that when e < 2047, the value assigned to the double-format bit pattern is formed by inserting the binary radix point immediately to the left of the fraction's most significant bit, and inserting an implicit bit immediately to the left of the binary point. The number thus formed is called the significand. The implicit bit is so named because its value is not explicitly given in the double- format bit pattern, but is implied by the value of the biased exponent field.

For the double format, the difference between a normal number and a subnormal number is that the leading bit of the significand (the bit to the left of the binary point) of a normal number is 1, whereas the leading bit of the significand of a subnormal number is 0. Double-format subnormal numbers were called double-format denormalized numbers in IEEE Standard 754.

The 52-bit fraction combined with the implicit leading significand bit provides 53 bits of precision in double-format normal numbers.

Examples of important bit patterns in the double-storage format are shown in Table 2-5. The bit patterns in the second column appear as two 8-digit hexadecimal numbers. For the SPARC architecture, the left one is the value of the lower addressed 32-bit word, and the right one is the value of the higher addressed 32-bit word, while for the Intel and PowerPC architectures, the left one is the higher addressed word, and the right one is the lower addressed word. The maximum positive normal number is the largest finite number representable in the IEEE double format. The minimum positive subnormal number is the smallest positive number representable in IEEE double format. The minimum positive normal number is often referred to as the underflow threshold. (The decimal values for the maximum and minimum normal and subnormal numbers are approximate; they are correct to the number of figures shown.)

Table 2-5 Bit Patterns in Double-Storage Format and their IEEE Values

Common Name
Bit Pattern (Hex)
Decimal Value

+ 0

00000000 00000000

0.0

- 0

80000000 00000000

-0.0

1

3ff00000 00000000

1.0

2

40000000 00000000

2.0

max normal number

7fefffff ffffffff

1.7976931348623157e+308

min positive normal number

00100000 00000000

2.2250738585072014e-308

max subnormal number

000fffff ffffffff

2.2250738585072009e-308

min positive subnormal number

00000000 00000001

4.9406564584124654e-324

+

7ff00000 00000000

Infinity

-

fff00000 00000000

-Infinity

Not-a-Number

7ff80000 00000000

NaN

Table 2-5 Bit Patterns in Double-Storage Format and their IEEE Values
Common Name	Bit Pattern (Hex)	Decimal Value
+ 0	`00000000 00000000`	0.0
- 0	`80000000 00000000`	-0.0
1	`3ff00000 00000000`	1.0
2	`40000000 00000000`	2.0
max normal number	`7fefffff ffffffff`	1.7976931348623157e+308
min positive normal number	`00100000 00000000`	2.2250738585072014e-308
max subnormal number	`000fffff ffffffff`	2.2250738585072009e-308
min positive subnormal number	`00000000 00000001`	4.9406564584124654e-324
+	`7ff00000 00000000`	Infinity
-	`fff00000 00000000`	-Infinity
Not-a-Number	`7ff80000 00000000`	NaN

A NaN (Not a Number) can be represented by any of the many bit patterns that satisfy the definition of NaN. The hex values of the NaN shown in Table 2-5 is just one of the many bit patterns that can be used to represent a NaN.

Double-Extended Format (SPARC and PowerPC)

These floating-point environment's quadruple-precision format conforms to the IEEE definition of double-extended format. The quadruple-precision format occupies four 32-bit words and consists of three fields: a 112-bit fraction, f; a 15-bit biased exponent, e; and a 1-bit sign, s. These are stored contiguously as shown in Figure 2-3.

In the SPARC architecture, the highest addressed 32-bit word contains the least significant 32-bits of the fraction, denoted f[31:0], while in the PowerPC architecture, the lowest addressed 32-bit word contains these bits. The next two 32-bit words (counting downwards on SPARC architecture and upwards on PowerPC architectures) contain f[63:32] and f[95:64], respectively. Bits 0:15 of the next word contain the 16 most significant bits of the fraction, f[111:96], with bit 0 being the least significant of these 16 bits, and bit 15 being the most significant bit of the entire fraction. Bits 16:30 contain the 15-bit biased exponent, e, with bit 16 being the least significant bit of the biased exponent and bit 30 being the most significant; and bit 31 contains the sign bit, s.

Figure 2-3 numbers the bits as though the four contiguous 32-bit words were one 128-bit word in which bits 0:111 store the fraction, f; bits 112:126 store the 15-bit biased exponent, e; and bit 127 stores the sign bit, s.

Figure 2-3 Double-Extended Format (SPARC and PowerPC)

The values of the bit patterns in the three fields f, e, and s, determine the value represented by the overall bit pattern.

Table 2-6 shows the correspondence between the values of the three constituent fields and the value represented by the bit pattern in quadruple-precision format. u means don't care, because the value of the indicated field is irrelevant to the determination of values for the particular bit patterns.

Table 2-6 Values Represented by Bit Patterns (SPARC and PowerPC)

Double-Extended Bit Pattern (SPARC, PowerPC)
Value

0 < e < 32767

(-1)^s x 2^e-16383 x 1.f (normal numbers)

e = 0, f 0
(at least one bit in f is nonzero)

(-1)^s x 2^-16382 x 0.f (subnormal numbers)

e = 0, f = 0
(all bits in f are zero)

(-1)^s x 0.0 (signed zero)

s = 0, e = 32767, f = 0
(all bits in f are zero)

+INF (positive infinity)

s = 1, e = 32767; f = 0
(all bits in f are zero)

-INF (negative infinity)

s = u, e = 32767, f 0
(at least one bit in f is nonzero)

NaN (Not-a-Number)

Table 2-6 Values Represented by Bit Patterns (SPARC and PowerPC)
Double-Extended Bit Pattern (SPARC, PowerPC)	Value
0 < `e` < 32767	(-1)^s x 2^e-16383 x 1.`f` (normal numbers)
`e` = 0, `f` 0 (at least one bit in `f` is nonzero)	(-1)^s x 2^-16382 x 0.`f` (subnormal numbers)
`e` = 0, `f` = 0 (all bits in `f` are zero)	(-1)^s x 0.0 (signed zero)
`s` = 0, `e` = 32767, `f` = 0 (all bits in `f` are zero)	+INF (positive infinity)
`s` = 1, `e` = 32767; `f` = 0 (all bits in `f` are zero)	-INF (negative infinity)
`s` = u, `e` = 32767, `f` 0 (at least one bit in `f` is nonzero)	NaN (Not-a-Number)

Table 2-7 Bit Patterns in Double-Extended Format (SPARC and PowerPC)
Common Name	Bit Pattern (SPARC and PowerPC)	Decimal Value
+0	`00000000 00000000 00000000 00000000`	0.0
-0	`80000000 00000000 00000000 00000000`	-0.0
1	`3fff0000 00000000 00000000 00000000`	1.0
2	`40000000 00000000 00000000 00000000`	2.0
max normal	`7ffeffff ffffffff ffffffff ffffffff`	1.1897314953572317650857593266280070e+4932
min normal	`00010000 00000000 00000000 00000000`	3.3621031431120935062626778173217526e-4932
max subnormal	`0000ffff ffffffff ffffffff ffffffff`	3.3621031431120935062626778173217520e-4932
min pos subnormal	`00000000 00000000 00000000 00000001`	6.4751751194380251109244389582276466e-4966
+	`7fff0000 00000000 00000000 00000000`	+Infinity
-	`ffff0000 00000000 00000000 00000000`	-Infinity
Not-a-Number	`7fff8000 00000000 00000000 00000000`	NaN

The hex values of the NaNs shown in Table 2-7 are just two of the many bit patterns that can be used to represent NaNs.

Double-Extended Format (Intel)

This floating-point environment's double-extended format conforms to the IEEE definition of double-extended formats. It consists of four fields: a 63-bit fraction, f; a 1-bit explicit leading significand bit, j; a 15-bit biased exponent, e; and a 1-bit sign, s.

In the family of Intel architectures, these fields are stored contiguously in ten successively addressed 8-bit bytes. However, the UNIX System V Application Binary Interface Intel 386 Processor Supplement (Intel ABI) requires that double-extended parameters and results occupy three consecutively addressed 32-bit words in the stack, with the most significant 16 bits of the highest addressed word being unused, as shown in Figure 2-4.

The lowest addressed 32-bit word contains the least significant 32 bits of the fraction, f[31:0], with bit 0 being the least significant bit of the entire fraction and bit 31 being the most significant of the 32 least significant fraction bits. In the middle addressed 32-bit word, bits 0:30 contain the 31 most significant bits of the fraction, f[62:32], with bit 0 being the least significant of these 31 most significant fraction bits, and bit 30 being the most significant bit of the entire fraction; bit 31 of this middle addressed 32-bit word contains the explicit leading significand bit, j.

In the highest addressed 32-bit word, bits 0:14 contain the 15-bit biased exponent, e, with bit 0 being the least significant bit of the biased exponent and bit 14 being the most significant; and bit 15 contains the sign bit, s. Although the highest order 16 bits of this highest addressed 32-bit word are unused by the family of Intel architectures, their presence is essential for conformity to the Intel ABI, as indicated above.

Figure 2-4 numbers the bits as though the three contiguous 32-bit words were one 96-bit word in which bits 0:62 store the 63-bit fraction, f; bit 63 stores the explicit leading significand bit, j; bits 64:78 store the 15-bit biased exponent, e; and bit 79 stores the sign bit, s.

Figure 2-4 Double-Extended Format (Intel)

The values of the bit patterns in the four fields f, j, e and s, determine the value represented by the overall bit pattern.

Table 2-8 shows the correspondence between the counting number values of the four constituent field and the value represented by the bit pattern. u means don't care, because the value of the indicated field is irrelevant to the determination of value for the particular bit patterns.

Table 2-8 Values Represented by Bit Patterns (Intel)

Double-Extended Bit Pattern (Intel)
Value

j = 0, 0 < e <32767

Unsupported

j = 1, 0 < e < 32767

(-1)^s x 2^e-16383 x 1.f (normal numbers)

j = 0, e = 0; f 0
(at least one bit in f is nonzero)

(-1)^s x 2^-16382 x 0.f (subnormal numbers)

j = 1, e = 0

(-1)^s x 2^-16382 x 1.f (pseudo-denormal numbers)

j = 0, e = 0, f = 0
(all bits in f are zero)

(-1)^s x 0.0 (signed zero)

j = 1; s = 0; e = 32767; f = 0
(all bits in f are zero)

+INF (positive infinity)

j = 1; s = 1; e = 32767; f = 0
(all bits in f are zero)

-INF (negative infinity)

j = 1; s = u; e = 32767; f = .1uuu -- uu

QNaN (quiet NaNs)

j = 1; s = u; e = 32767; f = .0uuu -- uu 0
(at least one of the u in f is nonzero)

SNaN (signaling NaNs)

Table 2-8 Values Represented by Bit Patterns (Intel)
Double-Extended Bit Pattern (Intel)	Value
`j` = 0, 0 < e <32767	Unsupported
`j` = 1, 0 < e < 32767	(-1)^s x 2^e-16383 x 1.`f` (normal numbers)
`j` = 0, `e` = 0; `f` 0 (at least one bit in `f` is nonzero)	(-1)^s x 2^-16382 x 0.`f` (subnormal numbers)
`j` = 1, `e` = 0	(-1)^s x 2^-16382 x 1.`f` (pseudo-denormal numbers)
`j` = 0, `e` = 0, `f` = 0 (all bits in f are zero)	(-1)^s x 0.0 (signed zero)
`j` = 1; `s` = 0; `e` = 32767; `f` = 0 (all bits in `f` are zero)	+INF (positive infinity)
`j` = 1; `s` = 1; `e` = 32767; `f` = 0 (all bits in `f` are zero)	-INF (negative infinity)
`j` = 1; `s` = u; `e` = 32767; `f` = .1uuu -- uu	QNaN (quiet `NaNs)`
`j` = 1; `s` = u; `e` = 32767; `f` = .0uuu -- uu 0 (at least one of the u in `f` is nonzero)	SNaN (signaling `NaNs)`

Notice that bit patterns in double-extended format do not have an implicit leading significand bit. The leading significand bit is given explicitly as a separate field, j, in the double-extended format. However, when e is nonzero, any bit pattern with j = 0 is unsupported in the sense that using such a bit pattern as an operand in floating-point operations provokes an invalid operation exception.

The union of the disjoint fields j and f in the double extended format is called the significand. When e < 32767 and j = 1, or when e = 0 and j = 0, the significand is formed by inserting the binary radix point between the leading significand bit, j, and the fraction's most significant bit.

For the double-extended format, the difference between a normal number and a subnormal number is that the explicit leading bit of the significand of a normal number is 1, whereas the explicit leading bit of the significand of a subnormal number is 0 and the biased exponent field e must also be 0. Subnormal numbers in double-extended format were called double-extended format denormalized numbers in IEEE Standard 754.

Examples of important bit patterns in the double-extended storage format appear in Table 2-9. The bit patterns in the second column appear as one
4-digit hexadecimal counting number, which is the value of the 16 least significant bits of the highest addressed 32-bit word (recall that the most significant 16 bits of this highest addressed 32-bit word are unused, so their value is not shown), followed by two 8-digit hexadecimal counting numbers, of which the left one is the value of the middle addressed 32-bit word, and the right one is the value of the lowest addressed 32-bit word. The maximum positive normal number is the largest finite number representable in the Intel double-extended format. The minimum positive subnormal number is the smallest positive number representable in the double-extended format. The minimum positive normal number is often referred to as the underflow threshold. (The decimal values for the maximum and minimum normal and subnormal numbers are approximate; they are correct to the number of figures shown.)

Table 2-9 Bit Patterns in Double-Extended Format and their Values (Intel)

Common Name
Bit Pattern (Intel)
Decimal Value

+0

0000 00000000 00000000

0.0

-0

8000 00000000 00000000

-0.0

1

3fff 80000000 00000000

1.0

2

4000 80000000 00000000

2.0

max normal

7ffe ffffffff ffffffff

1.18973149535723176505e+4932

min positive normal

0001 80000000 00000000

3.36210314311209350626e-4932

max subnormal

0000 7fffffff ffffffff

3.36210314311209350608e-4932

min positive subnormal

0000 00000000 00000001

3.64519953188247460253e-4951

+

7fff 80000000 00000000

+Infinity

-

ffff 80000000 00000000

-Infinity

quiet NaN with greatest fraction

7fff ffffffff ffffffff

QNaN

quiet NaN with least fraction

7fff c0000000 00000000

QNaN

signaling NaN with greatest fraction

7fff bfffffff ffffffff

SNaN

signaling NaN with least fraction

7fff 80000000 00000001

SNaN

Table 2-9 Bit Patterns in Double-Extended Format and their Values (Intel)
Common Name	Bit Pattern (Intel)	Decimal Value
+0	`0000 00000000 00000000`	0.0
-0	`8000 00000000 00000000`	-0.0
1	`3fff 80000000 00000000`	1.0
2	`4000 80000000 00000000`	2.0
max normal	`7ffe ffffffff ffffffff`	1.18973149535723176505e+4932
min positive normal	`0001 80000000 00000000`	3.36210314311209350626e-4932
max subnormal	`0000 7fffffff ffffffff`	3.36210314311209350608e-4932
min positive subnormal	`0000 00000000 00000001`	3.64519953188247460253e-4951
+	`7fff 80000000 00000000`	+Infinity
-	`ffff 80000000 00000000`	-Infinity
quiet NaN with greatest fraction	`7fff ffffffff ffffffff`	QNaN
quiet NaN with least fraction	`7fff c0000000 00000000`	QNaN
signaling NaN with greatest fraction	`7fff bfffffff ffffffff`	SNaN
signaling NaN with least fraction	`7fff 80000000 00000001`	SNaN

A NaN (Not a Number) can be represented by any of the many bit patterns that satisfy the definition of NaN. The hex values of the NaNs shown in Table 2-9 illustrate that the leading (most significant) bit of the fraction field determines whether a NaN is quiet (leading fraction bit = 1) or signaling (leading fraction bit = 0).

Ranges and Precisions in Decimal Representation

This section covers the notions of range and precision for a given storage format. It includes the ranges and precisions corresponding to the IEEE single and double formats, and to the implementations of IEEE double-extended format to SPARC, PowerPC, and Intel architectures. In explaining the notions of range and precision, reference is made to the IEEE single format.

The IEEE standard specifies that 32 bits be used to represent a floating point number in single format. Because there are only finitely many combinations of 32 zeroes and ones, only finitely many numbers can be represented by 32 bits.

One natural question is:

What are the decimal representations of the largest and smallest positive numbers that can be represented in this particular format?

Rephrase the question and introduce the notion of range:

What is the range, in decimal notation, of numbers that can be represented by the IEEE single format?

Taking into account the precise definition of IEEE single format, you can prove that the range of floating-point numbers that can be represented in IEEE single format (if restricted to positive normalized numbers) is as follows:

   1.175...x (10^-38) to 3.402...x (10⁺³⁸)

A second question refers to the precision (or as many people refer to it, the accuracy, or the number of significant digits) of the numbers represented in a given format. These notions are explained by looking at some pictures and examples.

The IEEE standard for binary floating-point arithmetic specifies the set of numerical values representable in the single format. Remember that this set of numerical values is described as a set of binary floating-point numbers. The significand of the IEEE single format has 23 bits, which together with the implicit leading bit, yield 24 digits (bits) of (binary) precision.

You obtain a different set of numerical values by marking the numbers:

   x = (x₁.x₂x₃...x_q) x (10ⁿ)

(representable by q decimal digits in the significand) on the number line.

Figure 2-5 exemplifies this situation:

Figure 2-5 Comparison of a Set of Numbers Defined by
Digital and Binary Representation

Notice that the two sets are different. Therefore, estimating the number q of significant decimal digits corresponding to 24 significant binary digits, requires reformulating the problem.

Reformulate the problem in terms of converting floating-point numbers between binary representations (the internal format used by the computer) and the decimal format (the format users are usually interested in). In fact, you may want to convert from decimal to binary and back to decimal, as well as convert from binary to decimal and back to binary.

It is important to notice that because the sets of numbers are different, conversions are in general inexact. If done correctly, converting a number from one set to a number in the other set results in choosing one of the two neighboring numbers from the second set (which one specifically is a question related to rounding).

Consider some examples. Assume you are trying to represent the number with the following decimal representation in IEEE single format:

   x = x1.x2 x3... x 10n

In the above example, the information contained in x has to be coded in a 32-bit word. Generally, this might be impossible (if there are too many digits in x), for example, some of the information might not fit in 32 bits. For example, use:

   y = 838861.2, z = 1.3

and run the following FORTRAN program:


REAL Y, Z Y = 838861.2 Z = 1.3 WRITE(,40) Y 40 FORMAT("y: ",1PE18.11) WRITE(,50) Z 50 FORMAT("z: ",1PE18.11)

The output from this program should be similar to:

y: 8.38861187500E+05 z: 1.29999995232E+00

y: 8.38861187500E+05 z: 1.29999995232E+00

The difference between the value 8.388612 x 10⁵ assigned to y and the value printed out is 0.000000125, which is seven decimal orders of magnitude smaller than y. The accuracy of representing y in IEEE single format is about 6 to 7 significant digits, or that y has about six significant digits if it is to be represented in IEEE single format.

Now formulate the question:

Assume you convert a decimal floating point number a to its IEEE single format binary representation b, and then translate b back to a decimal number c; how many orders of magnitude are between a and a - c?

Rephrase the question:

What is the number of significant decimal digits of a in the IEEE single format representation, or how many decimal digits are to be trusted as accurate when one represents x in IEEE single format?

The number of significant decimal digits is always between 6 and 9, that is, at least 6 digits, but not more than 9 digits are accurate (with the exception of cases when the conversions are exact, when infinitely many digits could be accurate).

Conversely, if you convert a binary number in IEEE single format to a decimal number, and then convert it back to binary, generally, you need to use at least 9 decimal digits to ensure that after these two conversions you obtain the number you started from.

The complete picture is given in Table 2-10:

Table 2-10 Range and Precision of Storage Formats

Format

Significant Digits (Binary)
Smallest Positive Normal Number
Largest Positive Number
Significant Digits (Decimal)

single

24

1.175... 10^-38

3.402... 10⁺³⁸

6-9

double

53

2.225... 10^-308

1.797... 10⁺³⁰⁸

15-17

double extended (SPARC, PowerPC)

113

3.362... 10^-4932

1.189... 10⁺⁴⁹³²

33-36

double extended (Intel)

64

3.362... 10^-4932

1.189... 10⁺⁴⁹³²

18-21

Table 2-10 Range and Precision of Storage Formats
Format	Significant Digits (Binary)	Smallest Positive Normal Number	Largest Positive Number	Significant Digits (Decimal)
single	24	1.175... 10^-38	3.402... 10⁺³⁸	6-9
double	53	2.225... 10^-308	1.797... 10⁺³⁰⁸	15-17
double extended (SPARC, PowerPC)	113	3.362... 10^-4932	1.189... 10⁺⁴⁹³²	33-36
double extended (Intel)	64	3.362... 10^-4932	1.189... 10⁺⁴⁹³²	18-21

Underflow

Underflow occurs, roughly speaking, when the result of an arithmetic operation is so small that it cannot be stored in its intended destination format without suffering a rounding error that is larger than usual.

Underflow Thresholds

Table 2-11 shows the underflow thresholds for single, double, and double-extended precision.

Table 2-11 Underflow Thresholds
Destination Precision	Underflow Threshold
single	smallest normal number largest subnormal number	1.17549435e-38 1.17549421e-38
double	smallest normal number largest subnormal number	2.2250738585072014e-308 2.2250738585072009e-308
double-extended (SPARC, PowerPC)	smallest normal number largest subnormal number	3.3621031431120935062626778173217526e-4932 3.3621031431120935062626778173217520e-4932
double-extended (Intel)	smallest normal number largest subnormal number	3.36210314311209350626e-4932 3.36210314311209350590e-4932

The positive subnormal numbers are those numbers between the smallest normal number and zero. Subtracting two (positive) tiny numbers that are near the smallest normal number might produce a subnormal number. Or, dividing the smallest positive normal number by two produces a subnormal result.

The presence of subnormal numbers provides greater precision to floating-point calculations that involve small numbers, although the subnormal numbers themselves have fewer bits of precision than normal numbers. Producing subnormal numbers (rather than returning the answer zero) when the mathematically correct result has magnitude less than the smallest positive normal number is known as gradual underflow.

There are several other ways to deal with such underflow results. One way, common in the past, was to flush those results to zero. This method is known as Store 0 and was the default on most mainframes before the advent of the IEEE Standard.

The mathematicians and computer designers who drafted IEEE Standard 754 considered several alternatives while balancing the desire for a mathematically robust solution with the need to create a standard that could be implemented efficiently.

Table 2-12 *ulp* Values
Precision	Value
single	ulp(1) = 2^-23 ~ 1.192092896e-07
double	ulp(1) = 2^-52 ~ 2.22044604925031308e-16
quadruple	ulp(1) = 2^-113 ~ 1.92592994438723585305597794258492732e-34

Recall that only a finite set of numbers can be exactly represented in any computer arithmetic. As the magnitudes of numbers get smaller and approach zero, the gap between neighboring representable numbers never widens but narrows. Conversely, as the magnitude of numbers gets larger, the gap between neighboring representable numbers widens.

For example, imagine you are using a binary arithmetic that has only 3 bits of precision. Then, between any two powers of 2, there are 2³ = 8 representable numbers, as shown in Figure 2-6.

Figure 2-6 Number Line

The number line shows how the gap between numbers doubles from one exponent to the next.

In the IEEE single format, the difference in magnitude between the two smallest positive subnormal numbers is approximately 10^-45, whereas the difference in magnitude between the two largest finite numbers is approximately 10³¹!

In Table 2-13, nextafter(x,+) denotes the next representable number after x as you move along the number line towards +.

Table 2-13 Gaps Between Representable Single-Format
Floating-Point Numbers

x
nextafter(x, +)
Gap

0.0

1.4012985e-45

1.4012985e-45

1.1754944e-38

1.1754945e-38

1.4012985e-45

1.0

1.0000001

1.1920929e-07

2.0

2.0000002

2.3841858e-07

16.000000

16.000002

1.9073486e-06

128.00000

128.00002

1.5258789e-05

1.0000000e+20

1.0000001e+20

8.7960930e+12

9.9999997e+37

1.0000001e+38

1.0141205e+31

Table 2-13 Gaps Between Representable Single-Format
Floating-Point Numbers
x	nextafter(x, +)	Gap
0.0	1.4012985e-45	1.4012985e-45
1.1754944e-38	1.1754945e-38	1.4012985e-45
1.0	1.0000001	1.1920929e-07
2.0	2.0000002	2.3841858e-07
16.000000	16.000002	1.9073486e-06
128.00000	128.00002	1.5258789e-05
1.0000000e+20	1.0000001e+20	8.7960930e+12
9.9999997e+37	1.0000001e+38	1.0141205e+31

Any conventional set of representable floating-point numbers has the property that the worst effect of one inexact result is to introduce an error no worse than the distance to one of the representable neighbors of the computed result. When subnormal numbers are added to the representable set and gradual underflow is implemented, the worst effect of one inexact or underflowed result is to introduce an error no greater than the distance to one of the representable neighbors of the computed result.

In particular, in the region between zero and the smallest normal number, the distance between any two neighboring numbers equals the distance between zero and the smallest subnormal number. The presence of subnormal numbers eliminates the possibility of introducing a roundoff error that is greater than the distance to the nearest representable number.

Because no calculation incurs roundoff error greater than the distance to any of the representable neighbors of the computed result, many important properties of a robust arithmetic environment hold, including these three:

x y x - y 0
(x-y) + y x, to within a rounding error in the larger of x and y
1/(1/x) x, when x is a normalized number, implying 1/x 0

An alternative underflow scheme is Store 0, which flushes underflow results to zero. Store 0 violates the first and second properties whenever x-y underflows. Also, Store 0 violates the third property whenever 1/x underflows.

Let represent the smallest positive normalized number, which is also known as the underflow threshold. Then the error properties of gradual underflow and Store 0 can be compared in terms of .

   gradual underflow: |error| < 1/2 ulp in

   Store 0:           |error|

There is a significant difference between 1/2 unit in the last place of

, and

itself.

Two Examples of Gradual Underflow Versus `Store` 0

The following are two well-known mathematical examples. The first example is an inner product.


sum = 0; for (i = 0; i < n; i++) { sum = sum + a[i] * y[i]; } result = sum / n;

With gradual underflow, result is as accurate as roundoff allows. In Store 0, a small but nonzero sum could be delivered that looks plausible but is wrong in nearly every digit. However, in fairness, it must be admitted that to avoid just these sorts of problems, clever programmers scale their calculations if they are able to anticipate where minuteness might degrade accuracy.

The second example, deriving a complex quotient, isn't amenable to scaling:

It can be shown that, despite roundoff, the computed complex result differs from the exact result by no more than what would have been the exact result if p + i · q and r + i · s each had been perturbed by no more than a few ulps. This error analysis holds in the face of underflows, except that when both a and b underflow, the error is bounded by a few ulps of |a + i · b|. Neither conclusion is true when underflows are flushed to zero.

This algorithm for computing a complex quotient is robust, and amenable to error analysis, in the presence of gradual underflow. A similarly robust, easily analyzed, and efficient algorithm for computing the complex quotient in the face of Store 0 does not exist. In Store 0, the burden of worrying about low-level, complicated details shifts from the implementor of the floating-point environment to its users.

The class of problems that succeed in the presence of gradual underflow, but fail with Store 0, is larger than the fans of Store 0 may realize. Many frequently used numerical techniques fall in this class:

Linear equation solving
Polynomial equation solving
Numerical integration
Convergence acceleration
Complex division

Does Underflow Matter?

Despite these examples, it can be argued that underflow rarely matters, and so, why bother? However, this argument turns upon itself.

In the absence of gradual underflow, user programs need to be sensitive to the implicit inaccuracy threshold. For example, in single precision, if underflow occurs in some parts of a calculation, and Store 0 is used to replace underflowed results with 0, then accuracy can be guaranteed only to around 10^-31, not 10^-38, the usual lower range for single-precision exponents.

This means that programmers need to implement their own method of detecting when they are approaching this inaccuracy threshold, or else abandon the quest for a robust, stable implementation of their algorithm.

Some algorithms can be scaled so that computations don't take place in the constricted area near zero. However, scaling the algorithm and detecting the inaccuracy threshold can be difficult and time-consuming for each numerical program.

IEEE Arithmetic

2

IEEE Arithmetic Model

IEEE Formats

Table 2-1 IEEE Formats and Language Types

Table 2-2 Values Represented by Bit Patterns in IEEE Single Format

Table 2-3 Bit Patterns in Single-Storage Format and their IEEE Values

Figure 2-2 Double-Storage Format

Table 2-4 Values Represented by Bit Patterns in IEEE Double Format

Table 2-5 Bit Patterns in Double-Storage Format and their IEEE Values

Table 2-6 Values Represented by Bit Patterns (SPARC and PowerPC)

Table 2-7 Bit Patterns in Double-Extended Format (SPARC and PowerPC)

Table 2-8 Values Represented by Bit Patterns (Intel)

Table 2-9 Bit Patterns in Double-Extended Format and their Values (Intel)

Table 2-10 Range and Precision of Storage Formats

Underflow

Table 2-11 Underflow Thresholds

Table 2-12 ulp Values

Table 2-13 Gaps Between Representable Single-Format Floating-Point Numbers

Does Underflow Matter?

Table 2-13 Gaps Between Representable Single-Format
Floating-Point Numbers