# Finite Arithmetic

This course is still very much in the rough!

## Summary

This course belongs to the track Numerical Algorithms in the Department of Scientific Computing in the School of Computer Science.

In this course, students will learn how numbers are represented on modern-day computers, where the limits and problems of such numbers are and how they may be overcome.

## Representation

On binary computers, whole numbers (Integers) are stored in bit arrays of fixed size. These arrays contain the binary representation of the decimal number.

The decimal number 37, for instance, can be converted to binary and stored in an 8-bit variable

${\displaystyle 37_{10}\rightarrow 100101_{2}\rightarrow }$

The range of numbers that can be stored in ${\displaystyle n}$ bits is ${\displaystyle 0\dots 2^{n}-1}$.

This trivial scheme works rather well, since binary arrays are integer by nature. However, mathematics is usually not restricted to integer arithmetic, so a different representation has to be found for real-valued numbers, such as ${\displaystyle 0.1}$, ${\displaystyle {\sqrt {2}}}$, ${\displaystyle \pi }$ and the like.

The most trivial approach for real-valued numbers would be to pack them into two integers, one representing the integer part, the other the fraction.

${\displaystyle 3.7_{10}\rightarrow a.b\rightarrow 11_{2}.10110011_{2}\rightarrow }$

Notice that the fractional part is not ${\displaystyle 1101_{2}=7_{10}}$, yet ${\displaystyle 0.10110011_{2}=0.7_{10}}$. The ${\displaystyle k}$th digit to the right of the "." in binary is ${\displaystyle 2^{-k}}$

 ${\displaystyle 0.1_{b}}$ ${\displaystyle 0.5_{10}}$ ${\displaystyle 0.01_{b}}$ ${\displaystyle 0.25_{10}}$ ${\displaystyle 0.001_{b}}$ ${\displaystyle 0.125_{10}}$ ${\displaystyle 0.0001_{b}}$ ${\displaystyle 0.0625_{10}}$ ${\displaystyle 0.\underbrace {00\dots 0} _{\times k}1}$ ${\displaystyle 2^{-(k+1)}}$

Using such a representation -- generally called a fixed-point representation -- results in very simple operations (addition, multiplication, division, etc...) yet it's range would be limited from

${\displaystyle 2^{-B_{b}}\dots 2^{B_{a}}-2^{-B_{b}}}$

where ${\displaystyle B_{a}}$ and ${\displaystyle B_{b}}$ are the number of bits in the integer and fractional parts respectively. This is not a very wide range.

Furthermore, for large numbers, we are carrying around a lot of lower-order bits which we probably don't need since, as we will see later, we are usually interested in relative accuracy (i.e. the first few digits of a number) as opposed to the absolute accuracy of the last digits of the fractional part. Likewise, for small numbers of the order of ${\displaystyle 2^{-B_{b}}}$, we are only precise up to a few digits although we are carrying around a bunch of zeros we don't really need.

Instead of using two number additively, as in the example above, we could represent our real numbers as the product of a fixed precision number and a factor of 2

${\displaystyle a\times 2^{b}}$

This is equivalent to shifting ${\displaystyle a}$ by ${\displaystyle b}$ bits. The "." in ${\displaystyle a}$ can therefore be set arbitrarily. for all practical purposes, we set it after the first digit.

${\displaystyle 3.7_{10}\rightarrow 1.110110011_{2}\times 2^{1_{b}}=11.10110011_{2}}$

This is the concept of modern floating-point numbers as described in IEEE 754. The range of numbers that can be represented is from ${\displaystyle 1\times 2^{-2^{B_{b}}}}$ to ${\displaystyle (2-2^{-(B_{a}-1)})\times 2^{2^{B_{b}}}}$.

TODO: Show how sum, multiplication and division are computed.

### Exercises

• The smallest and biggest numbers representable by ${\displaystyle a\times 2^{b}}$ are given as ${\displaystyle 1\times 2^{-2^{B_{b}}}}$ and ${\displaystyle (2-2^{-(B_{a}-1)})\times 2^{2^{B_{b}}}}$ respectively. Show what these numbers would look like for ${\displaystyle B_{a}=B_{b}=8}$.

• Addition in Finite Arithmetic Modulo 5 The addition tables in this arithmetic can be written as a bordered square too, tough addition can be performed by adding in the ordinary way and then subtracting an appropriate, multiple of 5; e.g to find 4+3+4+2 we add as in ordinary arithmetic, which results in 13, and then subtract 10, giving
                 4+3+4+2=3 (mod 5)


## IEEE 754 Floating Point Numbers

Show representation for single and double precision, sizes of mantissa and exponent

First bit of mantissa is always 1, normalized numbers, exponent is biased for easier arithmetic.

• Special values, such as 0, Nan and Inf have special representations.

Explain different rounding modes.

## Useful/Important Values

Concept of ${\displaystyle \varepsilon }$. Definition as ${\displaystyle \min _{\varepsilon }1+\varepsilon >1}$ and spacing between numbers 1.0 and 2.0.

Realmin and Realmax

Show how these values can be computed in Matlab, C and Pascal

Note on problem with extra bits (90 bits) on Intel Pentiums.

## Truncation

Basic explanation of the problem with an example, preferrably with a figure.

Explain how it can be avoided by sorting or with Kahan summation.

## Cancellation

Give an example illustrating the problem.

No trick, problems need to be re-formulated to avoid subtracting numbers of similar magnitude.

Show some examples how this can be done.

## Exercises

• Compute eps for some special representation, realmax, realmin.
• how could a*2^n be implemented efficiently?
• re-write some equation to avoid truncation and/or cancellation.