Skip to content

A line in the sand

The other day, someone (who shall remain nameless to protect their identity) asked me the following:

I have a set of data points, (X,Y), and I know they are related through a simple function:

Linear Equation

How do I determine m, and b?

Now, there are many ways to determine values for m and b; but, right now I'd like to discuss Least Sqaures fitting.

The original idea for least squares fitting came from Carl Gauss, and it stems from the idea that we'd like to minimize a measurement of the error in our result. In other words, we'd like to find m and b such that error is minimized, or as close to 0 as we can get:

Error Equation

Now, it turns out, minimizing the above function doesn't work really well. In particular, we could pick a and b such that the error is Error Equation, and that wouldn't leave us with a good value for m or b. We, therefore, choose to minimize the square of the error function:

Error Equation

Now, as we all remember from high school calculus (you took calculus in high school didn't you), we can minimize the error function, by taking the first derivative and set it equal to zero. Since, we're trying to minimize with respect to m and b, we'll have to differentiate with respect to both variables:

Partial b

Partial m

Now, if we rejigger the above formulas, we can find:

Simplified Partial b

Simplified Partial m

At this point, we could solve for m and b and be done with the whole thing, but there is a simplification we can make. If we first define the following:

x variance

y variance

yx covariance

Then through the power of algebra, we can simplify the above equations and solve for m and b. I'll leave the exercise to the reader, but the solution then ends up:

At this point, you might be saying, "That's all well and good Jeremy, but show me the code." So, without further ado, here's the code that can perform least squares linear regression in C:

 
////////////////////////////////////////
// Given vectors x and y, finds m and b via least squares
//  y = m*x + b
void
LinearRegression(unsigned N, float *x, float *y, float *outM, float *outB)
{
    unsigned i;
    float    xSum, ySum, ss_xx, ss_xy;
    xSum = ySum = ss_xx = ss_xy = 0;
    for(i = 0; i < N; ++i)
    {
        xSum += x[i];
        ySum += y[i];
        ss_xx += x[i] * x[i];
        ss_xy += x[i] * y[i];
    }
    ss_xx -= xSum * xSum / N;
    ss_xy -= xSum * ySum / N;
    *outM = ss_xy / ss_xx;
    *outB = (ySum - *outM * xSum) / N;
}
 

You'll notice that I juggled around the calculations of averages of x and y. Depending on data set size and precision, this might introduce some bugs. You might want to use the less optimal, but more correct:

 
////////////////////////////////////////
// Given vectors x and y, finds m and b via least squares
//  y = m*x + b
void
LinearRegression(unsigned N, float *x, float *y, float *outM, float *outB)
{
    unsigned i;
    float    xMean, yMean, ss_xx, ss_xy;
    xMean = yMean = ss_xx = ss_xy = 0;
    for(i = 0; i < N; ++i)
    {
        xMean += x[i] / N;
        yMean += y[i] / N;
        ss_xx += x[i] * x[i];
        ss_xy += x[i] * y[i];
    }
    ss_xx -= N * xMean * xMean;
    ss_xy -= N * xMean * yMean;
    *outM = ss_xy / ss_xx;
    *outB = yMean - *outM * xMean;
}
 



I'm just noticing that it's been forever since I posted. I apologize for that. I'll try to get neat stuff to say more often.

Post a Comment

Your email is never published nor shared. Required fields are marked *