Floating point/Lesson Four
Single Precision
[edit | edit source]In most computers, the IEEE sets a standard on how they should store numbers. This ensures that computer scientists are able to worry about the error, not learn how their particular computer operates.
Single precision numbers are numbers stored according to these rules:
1) Numbers are converted to ±k × 2m, where k is a binary number in form 1.f and m is the exponent. The number k is between 1 and 2, but you won't use the first digit, because that is assumed.
2) The number, k, is rounded so that it only contains 24 bits.
3) The exponent can be from -126 to +127. It is stored in the computer as m + 127.
4) The computer stores the following:
a) The sign bit (1 if negative, 0 if positive)
b) The biased exponent (8 bits, 00000001 to 11111110, signifying 1 to 254, but actually representing -126 to 127).
c) The actual number. This section is called the mantissa. Because the number is assumed to have a 1 at the beginning, the 1 is not stored, but the next 23 digits are. The 1 is called the hidden bit.
The point between the exponent and the mantissa is known as the radix point. See the IEEE Single Precision picture.
Single Precision Example
[edit | edit source]The number 1 would be stored as the following ('1' is equal to 1 × 2°):
0 01111111 00000000000000000000000 positive exponent equal to the first '1' is not included, number 127 (=0 + 127) next numbers are
Machine Epsilon, Single Precision
[edit | edit source]Machine epsilon, as a reminder, is the smallest possible number such that 1 + ε ≠ 1 on the machine. There are 23 bits available on the mantissa from the example above. Thus, as soon as 2-23 is added, another '1' will be stored, namely the mantissa will read 00000000000000000000001. Thus, ε = 2-23.
Matlab Code Example
[edit | edit source]Here, we demonstrate that machine epsilon is indeed 2-23. The command 'single' forces Matlab to store the number as a single number.
>> single ( 1 - single ( 1 + 2^(-25) ) ) ans = 0 >> single ( 1 - single ( 1 + 2^(-24) ) ) ans = 0 >> single ( 1 - single ( 1 + 2^(-23) ) ) ans = -1.1921e-007
Double Precision
[edit | edit source]Double precision operates in the same manner as single precision, except more space is allocated to a number. Again, we have 1 sign bit, but we also have 11 bits for the exponent and 52 bits for the mantissa. The exponent is biased again, but this time, it is by 1023. Machine epsilon is 2-52.
Interesting Proof
[edit | edit source]Here, we prove that the relative error of storing a number in single precision (indeed, any precision) is simply machine epsilon divided by 2. Chopping (not rounding) results in a relative error of ε.
Denote x- as the machine number below the actual number, and x+ as the number above. In single precision, x-=(0.1b1b2b3...b24)2 × 2k. Additionally, x+=[(0.1b1b2b3...b24)2 + 2-24] × 2k.
Assume without loss of generality that x is closer to x-. Then, |x - x-| ≤ (1/2) |x+ - x-| = 2-25 + k.
Then, .
Special Numbers
[edit | edit source]The 0 and 255 exponents, a -0 entry, and other values represent certain special numbers:
Type | Sign | Exponent | Significand | Value |
---|---|---|---|---|
Zero | 0 | 0000 0000 | 000 0000 0000 0000 0000 0000 | 0.0 |
One | 0 | 0111 1111 | 000 0000 0000 0000 0000 0000 | 1.0 |
Minus One | 1 | 0111 1111 | 000 0000 0000 0000 0000 0000 | −1.0 |
Smallest denormalized number | * | 0000 0000 | 000 0000 0000 0000 0000 0001 | ±2−23 × 2−126 = ±2−149 ≈ ±1.4 × 10-45 |
"Middle" denormalized number | * | 0000 0000 | 100 0000 0000 0000 0000 0000 | ±2−1 × 2−126 = ±2−127 ≈ ±5.88 × 10-39 |
Largest denormalized number | * | 0000 0000 | 111 1111 1111 1111 1111 1111 | ±(1−2−23) × 2−126 ≈ ±1.18 × 10-38 |
Smallest normalized number | * | 0000 0001 | 000 0000 0000 0000 0000 0000 | ±2−126 ≈ ±1.18 × 10-38 |
Largest normalized number | * | 1111 1110 | 111 1111 1111 1111 1111 1111 | ±(2−2−23) × 2127 ≈ ±3.4 × 1038 |
Positive infinity | 0 | 1111 1111 | 000 0000 0000 0000 0000 0000 | |
Negative infinity | 1 | 1111 1111 | 000 0000 0000 0000 0000 0000 | |
Not a number | * | 1111 1111 | non zero | NaN |
* Sign bit can be either 0 or 1 . |
(copied from IEEE 754-1985)
Homework
[edit | edit source]1) What is the difference between machine epsilon and realmin?
2) In single precision, what will the following computations yield?
1 + ε
1 + realmin
realmin + ε
(2 + ε) - 1
(-1 + ε) + 2
3) Find realmax in single precision, both by hand, and by using Matlab.
Sources
[edit | edit source](including proof) Cheney Ward and David Kincaid. Numerical Methods and Computing. Belmont, CA: Thomson, 2004.