There
is always a temptation, when one is using regression to fit a data set, to
add a few more parameters to make the fit better. Don’t give into the
temptation!
|
Temptation
|
In
the case of polynomial regression, this might be the delectable temptation of
adding one more degree to the polynomial. If y(x) is approximated
well by a weighted sum of 1, x2,
and x3, then
(one could reasonably assume) adding x4
and x5 will only make
the fit better, right? Technically,
it can’t make the fit worse, but in most real cases, the fit to the data will
be improved with every predictive term that is added.
|
If
one is doing multiple linear regression, the same thing applies. Let’s say,
for example, one is trying to determine a formula that will predict
tomorrow’s temperature. It would be reasonable to use the temperature today
and yesterday as predictors. It might reduce the residuals to add in today’s
humidity and the wind speed. The temperature from surrounding areas might
also improve the fit to historical data. The more parameters are added, the
better the fit.
|
There
is, of course, a law of diminishing returns. At some point, each additional
parameter may only reduce the residual error by a miniscule amount. So,
simply from the standpoint of simplicity, it makes sense to eliminate those
variables that don’t cause a significant improvement.
|
But
there is another reason to be parsimonious. This reason is more than a matter
of simple convenience. Adding additional parameters has the potential of
making regression go bad!
|
A simple regression
example
|
Let’s
say that we wanted to use regression to fit a polynomial to a Lorentzian1 function:
|
Equation for a
Lorentzian
|
This function is fairly well behaved, and is
illustrated in the graph below.
|
Plot of the
Lorentzian
|
Oh…
there’s a question in the back. “Why would someone want to do polynomial regression
to approximate this? Isn’t it easy enough just to compute the formula for the
Lorentzian directly?” Yes, it is. I am using this just as an example. This is
a way for me to get data that is nice and smooth and where we don’t have to
worry about the noise, and we can be sure what the data between the data
points looks like.
|
The
plot above shows knot points at integer locations, -5, -4, -3 … 4, 5. We will
use these eleven points as the data set that that feeds the regression. In
the graph below, we see what happens if we use a fourth order polynomial to
approximate the Lorentzian.
|
Fourth-order polynomial
fit to the Lorentzian
|
While
the overall shape isn’t horrible, the fourth order polynomial shown is red,
is a bit disappointing. Elsewhere it is not quite as bad, but at zero, the
polynomial predicts a value of 0.655, whereas it should be 1.000.
|
What
to do? Well, the natural response is to add more terms to the regression.
Each term will bring the polynomial closer to the knot points. In fact, since
there are eleven data points to go through, we know that we can find a tenth
order polynomial that will fit this exactly. Clearly this is a good goal,
right? Trying to drive the residual error down to zero, right? Then we will
have a polynomial that best matches our Lorentzian curve, right?
|
I
should hope that I have done enough foreshadowing to suggest to the reader
that this might not be the best course of action. And I hope that the
repetitive use of the italicized word “right” at the end of each question was
a further clue that something is about to go wrong.
|
A tenth-order
polynomial fit to the Lorentzian
|
And
wrong it did go. We see that the graph of the tenth-order polynomial function
did an absolutely fabulous job of approximating the Lorentzian at the knot
points. It goes right through each of those points. We also see that the fit
in the region from -1 to +1 the fit2 is not too bad. However, as we go further
away from zero, the fit gets worse and worse as it oscillates out of control.
Between 4 and 5, the polynomial traces out a curve that is just plain lousy.
Linear interpolation would have been so much better. Something went bad, and
it wasn’t last week’s lox in the back of the fridge.
|
Let’s
make this example a bit more practical. Often when we are doing regression,
we are not interested in predicting what happened between sample points, but
rather we ultimately want to predict what will happen outside the range.
Interpolation is sometimes useful, but we often want extrapolation. In the
weather prediction example, I want to know the weather tomorrow so I can plan
a picnic. I don’t usually care about what the temperature was an hour ago.
|
If
we look out just outside the range in our example, things get really bad. No…
not really bad. It gets really, really bad. I mean awful. I mean “Oh my God,
we are going to crash into the Sun” bad. The graph below shows what happens
if we extend the approximation out a little bit – just out to -8 and +8. Just
a few points beyond where we have data. How bad could that be?
|
A tenth-order
polynomial fit to the Lorentzian, extended to -8 to +8
|
The
observant reader will note that the scale of the plot was changed just a tiny
bit. The polynomial approximation at x=8 is -8669.7. I asked “how bad could
it be?” Well, if this model is going to predict tomorrow’s temperature to be
anything below eight thousand degrees below zero, I am going to give up my
plans for a picnic and party like crazy tonight because all the beer in my
fridge will be frozen tomorrow.
|
What went
wrong?
|
In
this example, there were two things that caused this regression to go
horribly wrong. The first is that I tried to push regression farther than it
should be pushed. Under no conditions should anyone use regression with that
many parameters when you only have eleven data points3. People who
do that should be sent away to have a picnic somewhere where it is eight
thousand degrees below zero. I am going to make the bold statement that, even
if you have 11,000 data points, you should be very careful about using eleven
parameters in your regression.
|
The
second thing that went wrong is that the mathematical model used to fit the
data was not so good. The Lorentzian function does something that polynomials
are just not good at – it has an asymptote. Polynomials just aren’t good at
asymptotes. The only polynomial that comes close to having a vertical
asymptote of zero is the polynomial that is identically zero everywhere. And
that polynomial is just plain boring. Even if we were a bit more reasonable
about the order of the polynomial that we used, we would never get a good
approximation of the Lorentzian out in the tails by using polynomials.
|
This
underscores the importance of taking the time to understand your data, to get
to know its needs, and especially its aspirations for the future, in order
for you to use a function that might be able to successfully approximate the
true underlying function.
|
1.
This is function goes by quite a long list of names, since it was invented
independently for different reasons. Someday I will write a blog explaining
this function’s fascinating history.
2.
For the benefit of the astute and persnickety reader, I realize that I am
using the word “fit” incorrectly. The term should be applied strictly to just
how the approximating function behaves at the points used in the regression.
It would have been more appropriate for me to have said “fit to the original
function”.
3.
Ok… My statement here is too strong. A discrete Fourier transform is a
regression, although we don’t normally think about it that way. Many folks,
myself included, have successfully avoided Siberian consequences using a DFT
that has as many parameters as data points. (But they still have to deal with
Gibb’s phenomena.) I apologize to anyone who just stuck their mukluks into
their picnic basket.
|
Well illustrated. Although I was little familiar with the regression, it was useful reading.
ReplyDeleteThank you, Bismillah. I wish you many happy regressions. :)
ReplyDelete