Wednesday, July 25, 2012

When regression goes bad


There is always a temptation, when one is using regression to fit a data set, to add a few more parameters to make the fit better. Don’t give into the temptation!

Temptation
In the case of polynomial regression, this might be the delectable temptation of adding one more degree to the polynomial. If y(x) is approximated well by a weighted sum of 1, x2, and x3, then (one could reasonably assume) adding x4 and x5 will only make the fit better, right? Technically, it can’t make the fit worse, but in most real cases, the fit to the data will be improved with every predictive term that is added.
If one is doing multiple linear regression, the same thing applies. Let’s say, for example, one is trying to determine a formula that will predict tomorrow’s temperature. It would be reasonable to use the temperature today and yesterday as predictors. It might reduce the residuals to add in today’s humidity and the wind speed. The temperature from surrounding areas might also improve the fit to historical data. The more parameters are added, the better the fit.
There is, of course, a law of diminishing returns. At some point, each additional parameter may only reduce the residual error by a miniscule amount. So, simply from the standpoint of simplicity, it makes sense to eliminate those variables that don’t cause a significant improvement.
But there is another reason to be parsimonious. This reason is more than a matter of simple convenience. Adding additional parameters has the potential of making regression go bad!
A simple regression example
Let’s say that we wanted to use regression to fit a polynomial to a Lorentzian1  function:

Equation for a Lorentzian
This function is fairly well behaved, and is illustrated in the graph below.

Plot of the Lorentzian
Oh… there’s a question in the back. “Why would someone want to do polynomial regression to approximate this? Isn’t it easy enough just to compute the formula for the Lorentzian directly?” Yes, it is. I am using this just as an example. This is a way for me to get data that is nice and smooth and where we don’t have to worry about the noise, and we can be sure what the data between the data points looks like.
The plot above shows knot points at integer locations, -5, -4, -3 … 4, 5. We will use these eleven points as the data set that that feeds the regression. In the graph below, we see what happens if we use a fourth order polynomial to approximate the Lorentzian.

Fourth-order polynomial fit to the Lorentzian
While the overall shape isn’t horrible, the fourth order polynomial shown is red, is a bit disappointing. Elsewhere it is not quite as bad, but at zero, the polynomial predicts a value of 0.655, whereas it should be 1.000.
What to do? Well, the natural response is to add more terms to the regression. Each term will bring the polynomial closer to the knot points. In fact, since there are eleven data points to go through, we know that we can find a tenth order polynomial that will fit this exactly. Clearly this is a good goal, right? Trying to drive the residual error down to zero, right? Then we will have a polynomial that best matches our Lorentzian curve, right?
I should hope that I have done enough foreshadowing to suggest to the reader that this might not be the best course of action. And I hope that the repetitive use of the italicized word “right” at the end of each question was a further clue that something is about to go wrong.

A tenth-order polynomial fit to the Lorentzian
And wrong it did go. We see that the graph of the tenth-order polynomial function did an absolutely fabulous job of approximating the Lorentzian at the knot points. It goes right through each of those points. We also see that the fit in the region from -1 to +1 the fit2  is not too bad. However, as we go further away from zero, the fit gets worse and worse as it oscillates out of control. Between 4 and 5, the polynomial traces out a curve that is just plain lousy. Linear interpolation would have been so much better. Something went bad, and it wasn’t last week’s lox in the back of the fridge.
Let’s make this example a bit more practical. Often when we are doing regression, we are not interested in predicting what happened between sample points, but rather we ultimately want to predict what will happen outside the range. Interpolation is sometimes useful, but we often want extrapolation. In the weather prediction example, I want to know the weather tomorrow so I can plan a picnic. I don’t usually care about what the temperature was an hour ago.
If we look out just outside the range in our example, things get really bad. No… not really bad. It gets really, really bad. I mean awful. I mean “Oh my God, we are going to crash into the Sun” bad. The graph below shows what happens if we extend the approximation out a little bit – just out to -8 and +8. Just a few points beyond where we have data. How bad could that be?

A tenth-order polynomial fit to the Lorentzian, extended to -8 to +8
The observant reader will note that the scale of the plot was changed just a tiny bit. The polynomial approximation at x=8 is -8669.7. I asked “how bad could it be?” Well, if this model is going to predict tomorrow’s temperature to be anything below eight thousand degrees below zero, I am going to give up my plans for a picnic and party like crazy tonight because all the beer in my fridge will be frozen tomorrow.
What went wrong?
In this example, there were two things that caused this regression to go horribly wrong. The first is that I tried to push regression farther than it should be pushed. Under no conditions should anyone use regression with that many parameters when you only have eleven data points3. People who do that should be sent away to have a picnic somewhere where it is eight thousand degrees below zero. I am going to make the bold statement that, even if you have 11,000 data points, you should be very careful about using eleven parameters in your regression.
The second thing that went wrong is that the mathematical model used to fit the data was not so good. The Lorentzian function does something that polynomials are just not good at – it has an asymptote. Polynomials just aren’t good at asymptotes. The only polynomial that comes close to having a vertical asymptote of zero is the polynomial that is identically zero everywhere. And that polynomial is just plain boring. Even if we were a bit more reasonable about the order of the polynomial that we used, we would never get a good approximation of the Lorentzian out in the tails by using polynomials.
This underscores the importance of taking the time to understand your data, to get to know its needs, and especially its aspirations for the future, in order for you to use a function that might be able to successfully approximate the true underlying function. 
1. This is function goes by quite a long list of names, since it was invented independently for different reasons. Someday I will write a blog explaining this function’s fascinating history.
2. For the benefit of the astute and persnickety reader, I realize that I am using the word “fit” incorrectly. The term should be applied strictly to just how the approximating function behaves at the points used in the regression. It would have been more appropriate for me to have said “fit to the original function”.
3. Ok… My statement here is too strong. A discrete Fourier transform is a regression, although we don’t normally think about it that way. Many folks, myself included, have successfully avoided Siberian consequences using a DFT that has as many parameters as data points. (But they still have to deal with Gibb’s phenomena.) I apologize to anyone who just stuck their mukluks into their picnic basket.


2 comments:

  1. Well illustrated. Although I was little familiar with the regression, it was useful reading.

    ReplyDelete
  2. Thank you, Bismillah. I wish you many happy regressions. :)

    ReplyDelete