Floating point/Lesson Four

Single Precision[edit | edit source]

In most computers, the IEEE sets a standard on how they should store numbers. This ensures that computer scientists are able to worry about the error, not learn how their particular computer operates.

Single precision numbers are numbers stored according to these rules:

1) Numbers are converted to ±k × 2^m, where k is a binary number in form 1.f and m is the exponent. The number k is between 1 and 2, but you won't use the first digit, because that is assumed.

2) The number, k, is rounded so that it only contains 24 bits.

3) The exponent can be from -126 to +127. It is stored in the computer as m + 127.

4) The computer stores the following:

a) The sign bit (1 if negative, 0 if positive)

b) The biased exponent (8 bits, 00000001 to 11111110, signifying 1 to 254, but actually representing -126 to 127).

c) The actual number. This section is called the mantissa. Because the number is assumed to have a 1 at the beginning, the 1 is not stored, but the next 23 digits are. The 1 is called the hidden bit.

The point between the exponent and the mantissa is known as the radix point. See the IEEE Single Precision picture.

Single Precision Example[edit | edit source]

The number 1 would be stored as the following ('1' is equal to 1 × 2°):

0        01111111           00000000000000000000000
positive exponent equal to  the first '1' is not included,
number   127 (=0 + 127)     next numbers are

Machine Epsilon, Single Precision[edit | edit source]

Machine epsilon, as a reminder, is the smallest possible number such that 1 + ε ≠ 1 on the machine. There are 23 bits available on the mantissa from the example above. Thus, as soon as 2^-23 is added, another '1' will be stored, namely the mantissa will read 00000000000000000000001. Thus, ε = 2^-23.

Matlab Code Example[edit | edit source]

Here, we demonstrate that machine epsilon is indeed 2^-23. The command 'single' forces Matlab to store the number as a single number.

>> single ( 1 - single ( 1 + 2^(-25) ) )

ans =

     0

>> single ( 1 - single ( 1 + 2^(-24) ) )

ans =

     0

>> single ( 1 - single ( 1 + 2^(-23) ) )

ans =

 -1.1921e-007

Double Precision[edit | edit source]

Double precision operates in the same manner as single precision, except more space is allocated to a number. Again, we have 1 sign bit, but we also have 11 bits for the exponent and 52 bits for the mantissa. The exponent is biased again, but this time, it is by 1023. Machine epsilon is 2^-52.

Interesting Proof[edit | edit source]

Here, we prove that the relative error of storing a number in single precision (indeed, any precision) is simply machine epsilon divided by 2. Chopping (not rounding) results in a relative error of ε.

Denote x_- as the machine number below the actual number, and x₊ as the number above. In single precision, x_-=(0.1b₁b₂b₃...b₂₄)₂ × 2^k. Additionally, x₊=[(0.1b₁b₂b₃...b₂₄)₂ + 2^-24] × 2^k.

Assume without loss of generality that x is closer to x_-. Then, |x - x_-| ≤ (1/2) |x₊ - x_-| = 2^{-25 + k}.

Then, $\left|{\frac {x-x_{-}}{x}}\right|\leq {\frac {2^{-25+k}}{(0.1b_{2}b_{3}b_{4}\ldots )_{2}\times 2^{k}}}\leq {\frac {2^{-25}}{\frac {1}{2}}}=2^{-24}={\frac {1}{2}}\epsilon$ .

Special Numbers[edit | edit source]

The 0 and 255 exponents, a -0 entry, and other values represent certain special numbers:

Type	Sign	Exponent	Significand	Value
Zero	0	0000 0000	000 0000 0000 0000 0000 0000	0.0
One	0	0111 1111	000 0000 0000 0000 0000 0000	1.0
Minus One	1	0111 1111	000 0000 0000 0000 0000 0000	−1.0
Smallest denormalized number	*	0000 0000	000 0000 0000 0000 0000 0001	±2⁻²³ × 2⁻¹²⁶ = ±2⁻¹⁴⁹ ≈ ±1.4 × 10^-45
"Middle" denormalized number	*	0000 0000	100 0000 0000 0000 0000 0000	±2⁻¹ × 2⁻¹²⁶ = ±2⁻¹²⁷ ≈ ±5.88 × 10^-39
Largest denormalized number	*	0000 0000	111 1111 1111 1111 1111 1111	±(1−2⁻²³) × 2⁻¹²⁶ ≈ ±1.18 × 10^-38
Smallest normalized number	*	0000 0001	000 0000 0000 0000 0000 0000	±2⁻¹²⁶ ≈ ±1.18 × 10^-38
Largest normalized number	*	1111 1110	111 1111 1111 1111 1111 1111	±(2−2⁻²³) × 2¹²⁷ ≈ ±3.4 × 10³⁸
Positive infinity	0	1111 1111	000 0000 0000 0000 0000 0000	$+\infty$
Negative infinity	1	1111 1111	000 0000 0000 0000 0000 0000	$-\infty$
Not a number	*	1111 1111	non zero	NaN
* Sign bit can be either 0 or 1 .

(copied from IEEE 754-1985)

Homework[edit | edit source]

1) What is the difference between machine epsilon and realmin?

2) In single precision, what will the following computations yield?

1 + ε

1 + realmin

realmin + ε

(2 + ε) - 1

(-1 + ε) + 2

3) Find realmax in single precision, both by hand, and by using Matlab.

Sources[edit | edit source]

(including proof) Cheney Ward and David Kincaid. Numerical Methods and Computing. Belmont, CA: Thomson, 2004.

IEEE 754-1985

Floating point/Lesson Four

Contents

Single Precision[edit | edit source]

Single Precision Example[edit | edit source]

Machine Epsilon, Single Precision[edit | edit source]

Matlab Code Example[edit | edit source]

Double Precision[edit | edit source]

Interesting Proof[edit | edit source]

Special Numbers[edit | edit source]

Homework[edit | edit source]

Sources[edit | edit source]

Navigation menu

Floating point/Lesson Four

Single Precision[edit | edit source]

Single Precision Example[edit | edit source]

Machine Epsilon, Single Precision[edit | edit source]

Matlab Code Example[edit | edit source]

Double Precision[edit | edit source]

Interesting Proof[edit | edit source]

Special Numbers[edit | edit source]

Homework[edit | edit source]

Sources[edit | edit source]

Navigation menu

Search