## Wednesday, September 12, 2012

### Finding the right model

Yogi Berra once said that “predictions are hard, especially when they are about the future”. Or maybe it was Niels Bohr who said it? Or Casey Stengel, or Mark Twain, or Sam Goldwyn, or Dan Quayle? Nobody is sure who first said it, because the past is also hard to predict.
I offer here one explanation of what makes prediction hard. It has to do with finding the right underlying model.
Which is the right model?
The population growth problem from 6th grade
The year was 1970. I was in sixth grade. The teacher gave us an exercise in looking for patterns which would (as a side effect) increase our awareness of the global population problem. We were given the data in the first two columns of the table at the right. Note that the years listed are for successive doublings of the world population. Our assignment was to compute the doubling time, shown in column three, and then predict the world population in the year 2020.
 Year World Population Doubling time 4331 BC 50 million 1131 BC 100 million 3200 470 AD 200 million 1600[1] 1270 AD 400 million 800 1670 AD 800 million 400 1870 AD 1.6 billion 200 1970 AD 3.2 billion 100
We were supposed to notice that each successive doubling time is half the previous doubling time. The world population doubled between 1870 and 1970 (in 100 years), so the next doubling to 6.4 billion would require 50 years. According to the rule we determined in class, the world populations would reach 6.4 billion in 2020.
Well, the estimate was a bit off. We reached 6.4 billion in 2005. Clearly the prediction was not drastic enough. But wait…
Taking this a step further, we would expect that the world population would hit 12.8 billion in 2045, and 25.6 billion in July of 2057. In sixth grade, there was something unsettling about this. Of course, the idea of unrestrained population growth was alarming, but somehow I couldn’t help but think that there was something wrong there.
The ludicrousness of this prediction only occurred to me years later. According to the model, the doubling period would eventually reach the very short time span of nine months. Now, in order for the population to double in this amount of time, every woman on the planet, aged 1 to 101, would need to be pregnant, and must give birth to twins[2]! The next doubling would occur in only four and a half months, so all the twins would need to be pregnant when they are born... What a curious world our descendants will live in!
In the year 2081

In his book 2081, A hopeful View of the Human Future, Gerard K. O'Neill made an interesting historical observation. He looked at the typical speed when someone travelled. Here is his chart.
 Mode of travel Top speed 1781 Stagecoach 6 MPH 1881 Train 60 MPH 1981 Jet 600 MPH 2081 ?? 6,000 MPH
He noted that every century, there has been a ten-fold increase in speed, so it would follow logically that in another century we will hop into our mass transit vehicles (whatever they may be) and speed off at eight times the speed of sound. Maybe that’s not unreasonable. If we are taking pleasure trips to the moon, then that might actually be rather slow.
Now if we take his formula further forward, we can see that in the year 2581, people will be regularly making trips at about ten times the speed of light. This stretches my credibility a little bit. I prefer to obey speed limits, especially when it comes to the speed of light. Clearly science fiction writers disagree with me on that.
If we go the other direction, we can see that O’Neill’s formula falls into ridiculousville almost immediately. His formula would predict that a typical speed of travel would be 0.6 MPH, about one-fifth of a person’s normal walking speed.
What we have here is another example of an inappropriate mathematical model being used to make predictions. We might as well play off the fact that the word “train” has half as many letters as “stagecoach”, and that “jet” as half as many letters as “train”. (Well… kinda.) The next big leap forward in travel will have 1.25 letters.
The exponential growth of energy usage
I had a sense of déjà vu when I took Environmental Geology in college. There was a homework question that was aimed at impressing on us the disastrous implications of unbridled exponential growth.
The problem stated that the annual world-wide energy usage had been increasing throughout the century at an annual rate of 5%. The problem went on to state that there is a theoretical upper bound on the total energy available if all the matter in the entire Earth is converted into energy. This upper bound is stated in Einstein’s formula e = mc2. Figures were given for the mass of the Earth and current energy usage, and the question was asked: In what year would our annual energy requirements equal the total available energy?
The answer we arrived at was only a few millennia away, perhaps the year 3500, or maybe 10,000, I honestly don’t remember. I do remember the next question, and the answer that I gave. The question was “What do you conclude?” I knew that the correct answer was “It is high time for us to get off our lazy butts and do something about the energy crisis, because in fewer than 100 generations there will be a million billion trillion people standing on the last crumb that is left of the Earth.
I knew that, but I was stubborn. My answer was “Nothing”. For some reason, I did not receive full credit for that answer. The university environment does not favor the creative mind. Or the lazy one, either.
My comeback
I had lost one point on a homework assignment that was worth 1% of my grade in the class. I should have just stopped right there, but I had a point to make. It was all about the principle of the thing. I spent hours and hours defending my short answer.
First, I amended my answer a little, from “nothing” to “nothing, because the answer depends a great deal on the underlying mathematical model that is assumed for the growth of energy usage”. Then, I got out my slide rule and did some curve fitting. I don’t have the original work. I am sure that the paper I wrote it out on has long since crumbled to dust. I will reproduce the salient aspects of it.
In this first graph, I show some data that I cooked up. The data is an exponential curve with the some added to it. In the graph, I also show the least squares fit of an exponential curve. This shows that the data can be approximated with an exponential curve that is increasing at 5.33% per year for 50 years.
Hypothetical energy usage, with exponential curve
I will take this to be a reasonable approximation of what energy usage data might look like. And, based on this data, a scientist or policymaker might come to the conclusion that energy usage is going up at a rate of about 5% per year. (Actually, the data I have dug up looks more chaotic than this. Wars and other catastrophes do a good job of being unpredictable.)
In this next graph, I show that same data, but this time it is approximated by a parabola. Looking at this curve fit, I don’t think I could testify in court that the data must be an exponential. A parabola doesn’t do too bad at fitting the data.
Hypothetical energy usage, with parabolic fit curve
Although, maybe the fit is not so good at the far left side? The parabola dips downward just a tiny bit in the first few years, and the data seems to be going upward. I can fix this fairly easily by using a cubic parabola, a third order polynomial. Just looking at the graphs, I can see no reason why someone would reject this fit over the fit of the exponential.
Hypothetical energy usage, with third order polynomial fit
Why stop at third order? Just for grins, I had a look at fitting the data with a fifth order polynomial. Once again, the fit looks pretty good.
Hypothetical energy usage, with fifth order polynomial fit
Why not try seventh order? Well, I did try it, and I think I will reject this one, maybe just for aesthetic reasons. The curve is a little bumpy. I am not sure the data has enough evidence to support those bumps.
Hypothetical energy usage, with seventh order polynomial fit
Taking a brief excursion back to ridiculousville, I tried using a 20th order polynomial to fit the data. Clearly the wiggles in this polynomial are not a true feature of the real data, but a feature of the noise. (Those who read my post on When regression goes bad will understand why this failed.
Hypothetical energy usage, with twentieth order polynomial fit
I did one last curve fit. This one starts with some reasonable assumptions about growth. An exponential curve is a reasonable approximation for growth of most physical things at the onset, but eventually in any real system, there has to be some saturation. Bunnies multiply, but eventually they run out of food.
For anything real, there has to be constrained growth. One commonly used model for this is the logistics curve. This curve shows initial exponential growth, but the growth gradually slows down as it approaches an asymptote. Once again, the fit looks fairly reasonable.
Hypothetical energy usage, with logistic curve  fit
The punch line
So far, all I have done is demonstrate that a number of curves can be bent around to look like a noisy exponential growth curve. At arm’s length, they all do a modest job at approximating the data that I provided. While some curves are somewhat better than others, there is no slam dunk best curve. Hang on to that thought, because the punch line is coming.
Each of the curves that I have fit to the data can be used to predict what the energy usage will be at some later date. I have gone through that exercise with each of these curves to yield a prediction about the energy usage in year 100 (50 years beyond the end of the data), and at year 500.
 Model Year 100 Year 500 Exponential 180.6 1.919 X 1011 Parabola 51.03 1337 Cubic 87.83 11,160 Fifth order -10.30 -1.608 X 106 Sixth order -4,978 -3.662 X 108 Seventh order -39,700 -1.668 X 1010 Eighth order -159,000 -3.798 X 1011 Fifteenth order 4.355 X 1010 5.756 X 1022 Twentieth order -5.857 X 1013 -4.471 X 1029 Logistic 27.86 28.42
I hope you are saying wow. These equations all looked kind of similar from t = 0 to t = 50.  Even at t = 100, we have mega-ginormous disagreements on what the energy usage will be: anywhere from the silly value of -6 X 1013 up to the tremendous value of 4 X 1010. I think I can safely say that I have made my point. The underlying choice of model might not matter much for interpolation, but for extrapolation, the choice of model can change an estimate by many orders of magnitude.
Actually, the only model that did not give huge answers for the 500 year estimate is the logistic model, the mathematical equation that was designed to model constrained growth. Hmmm…
Conclusion
Considerable effort usually goes into finding an equation that fits the existing data. Often, a variety of equations are tried and the one that best fits the existing data is chosen.
This typical process omits a crucial step. That step getting to know the data, understanding the natural constraints, and looking at the forces that drive the values up or down. This knowledge should drive the choice of mathematical model.

[1] Little bit of trivia here... the year before 1 AD was 1 BC, rather than 0. The correct difference between these two years is thus 1600, rather than 1601.
[2] I assume that there are an equal number of men and women. If there is only a single, very busy man, the women can be spared having to have twins.