Least squares

We use the method of Least squares when we have a series of measures (x_i, y_i) with i = 1, 2, ..., n (i.e., we measured a set of values we called y, and each of these depended on the value of a variable we called x), and we know that the measured points (x_i, y_i), when drawn on a plane, should (in theory) form a straight line.

Because of measurement errors, the points measured, most probably, will not form a straight line, but they will be approximately aligned. We are interested in finding the straight line which is the most similar to the points measured. So we want to find a function $f(x)=ax+b$ (that is, a straight line) such that $f(x_{i})\approx y_{i}.$ We only need to find the appropriate a and b and we will have found our function.

How do we express most similar mathematically?

The criterion we will follow to find those a and b is to minimize the error $f(x_{i})-y_{i}$ . If we define it this way, this error is the vertical distance between each measured point and the straight line we are seeking. To minimize the total error, we will try to minimize the sum of all errors:

E=\sum _{i=1}^{n}(f(x_{i})-y_{i})

.

But this formula for the global error has a problem: If we have two points, the value $y_{i}$ of the first being far below the line (so with a big positive error), and the other one, $y_{j}$ , being far above the line (so with a big negative error), when we sum both errors they will cancel out mutually, giving a total error of $E=0$ . Evidently, that is not what we are seeking. The solution to this problem is to change our formula for the total error and try this one:

E=\sum _{i=1}^{n}(f(x_{i})-y_{i})^{2}

.

We will not sum the individual errors, but their squares. This way, all of them will be positive, and to minimize the sum all of them will have to tend to 0.

Why don't we use the absolute value of the errors (

|f(x_{i})-y_{i}|

)? Because, as we will see, if we use the square of the errors there exists an exact formula to find the a and b which minimize the error (and with the absolute values there is not such formula).

This is why the method is called "least squares", because it tries to find the line which produces the least squares of the individual errors.

Note that we are assuming that the x values are exact, and all the errors are in the y values. In most cases this may be not true, and we could have errors in the y and in the x values. There are other methods which consider this, but the least squares method is the most simple and the most used.

Which are a and b, then?

The values for a and b that minimize the total error E are:

a={\frac {n\sum _{i=1}^{n}(x_{i}y_{i})-\sum _{i=1}^{n}x_{i}\sum _{i=1}^{n}y_{i}}{n\sum _{i=1}^{n}x_{i}^{2}-(\sum _{i=1}^{n}x_{i})^{2}}}

and

b={\frac {\sum _{i=1}^{n}x_{i}^{2}\sum _{i=1}^{n}y_{i}-\sum _{i=1}^{n}x_{i}\sum _{i=1}^{n}(x_{i}y_{i})}{n\sum _{i=1}^{n}x_{i}^{2}-(\sum _{i=1}^{n}x_{i})^{2}}}

The most convenient way to calculate these values is to tabulate your data in columns, namely $x_{i}\;$ , $y_{i}\;$ , $x_{i}^{2}$ and $x_{i}y_{i}\;$ . After calculating the products for each pair $(x_{i},y_{i})\;$ , we sum each column, obtaining this way $\sum _{i=1}^{n}x_{i}\;$ , $\sum _{i=1}^{n}y_{i}\;$ , $\sum _{i=1}^{n}x_{i}^{2}\;$ and $\sum _{i=1}^{n}(x_{i}y_{i})\;$ . With these values calculated we only have to substitute them in the formulas for a and b and we are done.

Of course, all the sums and products can be done automatically using a spreadsheet such as Excel. The steps are explained Least squares/Calculation using Excel here with an example.

How to find a and b

To find a and b we have to minimize the two-variables function E discussed above. To minimize N-variables functions, as you shall see when you see partial derivatives, you have to find the point(s) where all of its partial derivatives become 0.

Considering the form of E, this gives a set of two linear equations of the two variables a and b. The system is easily solved using Crammer's rule (or equivalently, since the matrix of coefficients is 2x2, inverting it, which is trivial). And the solution found is the expressions given above for a and b.