John the Math Guy: prediction

Showing posts with label prediction. Show all posts

Wednesday, August 7, 2013

Lousy weather every weekend

It was a Monday after a rainy weekend. Naturally, since I had to go to work, the weather was excellent. I was particularly downcast since the weather had put a damper on the previous weekend as well. I mentioned it at work, and got a surprising response.

Actual photo of me enjoying the lovely weekend weather [1]

I should tell you that this was quite a long time ago, back before Al Gore invented the internet. I was working at the University of Wisconsin Space Science and Engineering Center. You would think that with name like that, we would all be working on something dull like a space telescope, a Mars rover, or a space shuttle porta-potty. But it was far more exciting. We were crunching data from weather satellites.

Getting back to my story about a forlorn math guy complaining about the weather, when I made my comment about two weekends lost to lousy weather, I just happened to be surrounded by meteorologists. I was told that everybody knew that weather was cyclical, and it tends to follow a weekly pattern. If you know the weather on one particular Saturday, then you have a darn good guess at the weather the following Saturday.

I have used this factoid frequently. It always sounds impressive when I tell people what next weekend's weather is going to be like. I should say, it sounds impressive when they believe me. Whether my prediction comes to is actually irrelevant. No one remembers my predictions.

Is the "weekly weather pattern" factoid for real? I put it to the test.

The data

It isn't hard to find information on the current weather conditions. But yesterday's weather? That's a bit harder to find. And historical data? I found a place to dig up old weather. [2]

I downloaded four year's worth of weather data from Milwaukee; from July 22, 2009 to July 22, 2013. Why four years? Well... there is a story there. Once upon a time, there was a computer magazine called Byte. It was the magazine for computer geeks. They published an article with a BASIC program for doing a fast Fourier transform. I dug into the code and found it wasn't remotely the FFT (fast Fourier transform) algorithm that Cooley and Tukey made famous. Later, they had an article that looked at sales data that spanned many years of time. They said that a "four year analysis" was used to look for trends. I dropped my subscription. [3]

Mean temperature, Fourier anaylsis

Let's look first at the temperature (see plot below). It is apparent from the sine wave kinda thingie in the plot that I have collected four years' of data, and that the start of the data is kinda mid-summer.

Average daily temperature in Milwaukee over the past four years

That graph is and of itself is pretty cool, but playing with the data? I don't know about other math guys, but the first thing I like to do when I get a new fresh pile of data is to start taking Fourier transforms. It took a bit of massaging of the raw FFT output, but here is a graph showing a section of the frequency data. This graph shows the strength of the periodicity going from a four day period (on the left) to a 10 day period (on the right). There is no discernible peak or bump near the once-per-week mark.

FFT of the mean daily temperature

Correlation

Here is another look at that same data. In this case, instead of doing the whole Fourier bit, I did some correlations. I check to see if the daily temperature on a given day correlated with the daily temperature on the previous day. So... an array of four years of data, 1462 data points. I computed the correlation coefficient between two sub arrays, one going from day 1 to day 1461, and the other going from day 2 to day 1462. Then I computed the same thing, only with a lag of two days, and three days, and so on.

The graph below shows the results. At the very, very left, the correlation between each day's temperature and that same day's temperature is 1.0. Well, duh. The temperature on a given day looks a heckuva lot like the temperature on that same day. The plot below shows that today's temperature also looks a lot like yesterday's, with a correlation coefficient of close to 0.95. In the middle of the chart we see the correlation of today with one week ago. According to the hoity-toity meteorologists, I should see a big spike there, but no. Using the temperature a week ago to predict today's temperature is good (r = 0.85), but it is no better or worse than using the temperature six days or eight days ago.

Correlation coefficient of temperature as a function of spacing (in days)

So, I would say this myth is pretty well busted. That's the way I like my myths: pretty and well-busted. Temperature does not follow a hebdomadal pattern. That's a pity. It would have explained why a week isn't six days, or (shudder the thought!) eight days.

But wait! What about rain?!?!?

I missed something here, didn't I? This blog post started out talking about rainy, not temperature. Maybe I should have a look see at the rain data? Luckily, this database contains precipitation data, as well as wind speed and direction, barometric pressure, humidity, ... but all I want is the rain data.

I present below another correlation chart, this one showing the correlation of precipitation. There it is, as plain as the nose on Owen Wilson's face, a spike at lucky seven. There is a correlation (r = 0.1). This is roughly in the range of 99.95% significance range. So. Maybe my meteorologist friends weren't all that dumb after all?

Correlation coefficient of precipitation as a function of spacing (in days)

I should make a comment here about how big the number 0.1 is when it comes to correlation coefficients. Let's just say that Norm decided to make some money on these results. Every night, for four years, he would sit at a bar in Milwaukee, and take bets on whether it would rain the following week. I know, ideal job, right? If Norm follows the advice in this blog, betting that the rain today can predict that of one week from today, I guarantee that he would come out ahead at the end of the four years.

Norm, pondering this money-making scheme

On the other hand, the number 0.1 is very tiny. The predictive strength when r = 0.1 is on the order of 0.005. That's an indication of how much the odds are in Norm's favor. He is pretty sure of making money, but he needs to make a lot of bets to get there.

This may sound like a bad business model, but this is how casinos work. They want to run games where the table is tilted ever so slightly in their favor. Too much of a tilt (too low a chance of the patron winning), and people will walk away. Too little of a tilt, and the casino won't make a buck.

Another comment on the plot above. There is also a peak out at 29 and 31 days. Hmmmm... Maybe this is the effect of the moon?

---------------------------------

[1] If truth be told, although I have sung before, and have been in the rain, but this is actually a picture of Gene Kelly, not me. If more truth be told, Jean Kelly is my sister. Here is my favorite painting of hers.

[2] If truth be told, I didn't find this website on my own. Nate Silver told me about it. If even more truth be told, he didn't actually tell me in person. I read it in one of his books: The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. He and I are good buddies. At least we would be if we ever met. That's my prediction.

[3] I had a house that I was getting ready to sell. The realtor told me that the entry way needed work - first impressions and all that. Updating this would give the whole house a new image. So, I gave it a fresh coat of paint, updated the light fixture, and polished up the handle of the big front door. I guess you could say that I performed image enhancement through a fast foyer transform. Some of you will find this joke incredibly funny.

Wednesday, September 12, 2012

Finding the right model

Yogi Berra once said that “predictions are hard, especially when they are about the future”. Or maybe it was Niels Bohr who said it? Or Casey Stengel, or Mark Twain, or Sam Goldwyn, or Dan Quayle? Nobody is sure who first said it, because the past is also hard to predict.

I offer here one explanation of what makes prediction hard. It has to do with finding the right underlying model.

Which is the right model?

The population growth problem from 6th grade

The year was 1970. I was in sixth grade. The teacher gave us an exercise in looking for patterns which would (as a side effect) increase our awareness of the global population problem. We were given the data in the first two columns of the table at the right. Note that the years listed are for successive doublings of the world population. Our assignment was to compute the doubling time, shown in column three, and then predict the world population in the year 2020.

Year	World Population	Doubling time
4331 BC	50 million
1131 BC	100 million	3200
470 AD	200 million	1600[1]
1270 AD	400 million	800
1670 AD	800 million	400
1870 AD	1.6 billion	200
1970 AD	3.2 billion	100

Sixth grade assignment

We were supposed to notice that each successive doubling time is half the previous doubling time. The world population doubled between 1870 and 1970 (in 100 years), so the next doubling to 6.4 billion would require 50 years. According to the rule we determined in class, the world populations would reach 6.4 billion in 2020.

Well, the estimate was a bit off. We reached 6.4 billion in 2005. Clearly the prediction was not drastic enough. But wait…

Taking this a step further, we would expect that the world population would hit 12.8 billion in 2045, and 25.6 billion in July of 2057. In sixth grade, there was something unsettling about this. Of course, the idea of unrestrained population growth was alarming, but somehow I couldn’t help but think that there was something wrong there.

The ludicrousness of this prediction only occurred to me years later. According to the model, the doubling period would eventually reach the very short time span of nine months. Now, in order for the population to double in this amount of time, every woman on the planet, aged 1 to 101, would need to be pregnant, and must give birth to twins[2]! The next doubling would occur in only four and a half months, so all the twins would need to be pregnant when they are born... What a curious world our descendants will live in!

In the year 2081

Available from Amazon

In his book 2081, A hopeful View of the Human Future, Gerard K. O'Neill made an interesting historical observation. He looked at the typical speed when someone travelled. Here is his chart.

	Mode of travel	Top speed
1781	Stagecoach	6 MPH
1881	Train	60 MPH
1981	Jet	600 MPH
2081	??	6,000 MPH

He noted that every century, there has been a ten-fold increase in speed, so it would follow logically that in another century we will hop into our mass transit vehicles (whatever they may be) and speed off at eight times the speed of sound. Maybe that’s not unreasonable. If we are taking pleasure trips to the moon, then that might actually be rather slow.

Now if we take his formula further forward, we can see that in the year 2581, people will be regularly making trips at about ten times the speed of light. This stretches my credibility a little bit. I prefer to obey speed limits, especially when it comes to the speed of light. Clearly science fiction writers disagree with me on that.

If we go the other direction, we can see that O’Neill’s formula falls into ridiculousville almost immediately. His formula would predict that a typical speed of travel would be 0.6 MPH, about one-fifth of a person’s normal walking speed.

What we have here is another example of an inappropriate mathematical model being used to make predictions. We might as well play off the fact that the word “train” has half as many letters as “stagecoach”, and that “jet” as half as many letters as “train”. (Well… kinda.) The next big leap forward in travel will have 1.25 letters.

The exponential growth of energy usage

I had a sense of déjà vu when I took Environmental Geology in college. There was a homework question that was aimed at impressing on us the disastrous implications of unbridled exponential growth.

The problem stated that the annual world-wide energy usage had been increasing throughout the century at an annual rate of 5%. The problem went on to state that there is a theoretical upper bound on the total energy available if all the matter in the entire Earth is converted into energy. This upper bound is stated in Einstein’s formula e = mc². Figures were given for the mass of the Earth and current energy usage, and the question was asked: In what year would our annual energy requirements equal the total available energy?

The answer we arrived at was only a few millennia away, perhaps the year 3500, or maybe 10,000, I honestly don’t remember. I do remember the next question, and the answer that I gave. The question was “What do you conclude?” I knew that the correct answer was “It is high time for us to get off our lazy butts and do something about the energy crisis, because in fewer than 100 generations there will be a million billion trillion people standing on the last crumb that is left of the Earth.

I knew that, but I was stubborn. My answer was “Nothing”. For some reason, I did not receive full credit for that answer. The university environment does not favor the creative mind. Or the lazy one, either.

My comeback

I had lost one point on a homework assignment that was worth 1% of my grade in the class. I should have just stopped right there, but I had a point to make. It was all about the principle of the thing. I spent hours and hours defending my short answer.

First, I amended my answer a little, from “nothing” to “nothing, because the answer depends a great deal on the underlying mathematical model that is assumed for the growth of energy usage”. Then, I got out my slide rule and did some curve fitting. I don’t have the original work. I am sure that the paper I wrote it out on has long since crumbled to dust. I will reproduce the salient aspects of it.

In this first graph, I show some data that I cooked up. The data is an exponential curve with the some added to it. In the graph, I also show the least squares fit of an exponential curve. This shows that the data can be approximated with an exponential curve that is increasing at 5.33% per year for 50 years.

Hypothetical energy usage, with exponential curve

I will take this to be a reasonable approximation of what energy usage data might look like. And, based on this data, a scientist or policymaker might come to the conclusion that energy usage is going up at a rate of about 5% per year. (Actually, the data I have dug up looks more chaotic than this. Wars and other catastrophes do a good job of being unpredictable.)

In this next graph, I show that same data, but this time it is approximated by a parabola. Looking at this curve fit, I don’t think I could testify in court that the data must be an exponential. A parabola doesn’t do too bad at fitting the data.

Hypothetical energy usage, with parabolic fit curve

Although, maybe the fit is not so good at the far left side? The parabola dips downward just a tiny bit in the first few years, and the data seems to be going upward. I can fix this fairly easily by using a cubic parabola, a third order polynomial. Just looking at the graphs, I can see no reason why someone would reject this fit over the fit of the exponential.

Hypothetical energy usage, with third order polynomial fit

Why stop at third order? Just for grins, I had a look at fitting the data with a fifth order polynomial. Once again, the fit looks pretty good.

Hypothetical energy usage, with fifth order polynomial fit

Why not try seventh order? Well, I did try it, and I think I will reject this one, maybe just for aesthetic reasons. The curve is a little bumpy. I am not sure the data has enough evidence to support those bumps.

Hypothetical energy usage, with seventh order polynomial fit

Taking a brief excursion back to ridiculousville, I tried using a 20th order polynomial to fit the data. Clearly the wiggles in this polynomial are not a true feature of the real data, but a feature of the noise. (Those who read my post on When regression goes bad will understand why this failed.

Hypothetical energy usage, with twentieth order polynomial fit

I did one last curve fit. This one starts with some reasonable assumptions about growth. An exponential curve is a reasonable approximation for growth of most physical things at the onset, but eventually in any real system, there has to be some saturation. Bunnies multiply, but eventually they run out of food.

For anything real, there has to be constrained growth. One commonly used model for this is the logistics curve. This curve shows initial exponential growth, but the growth gradually slows down as it approaches an asymptote. Once again, the fit looks fairly reasonable.

Hypothetical energy usage, with logistic curve fit

The punch line

So far, all I have done is demonstrate that a number of curves can be bent around to look like a noisy exponential growth curve. At arm’s length, they all do a modest job at approximating the data that I provided. While some curves are somewhat better than others, there is no slam dunk best curve. Hang on to that thought, because the punch line is coming.

Each of the curves that I have fit to the data can be used to predict what the energy usage will be at some later date. I have gone through that exercise with each of these curves to yield a prediction about the energy usage in year 100 (50 years beyond the end of the data), and at year 500.

Model	Year 100	Year 500
Exponential	180.6	1.919 X 10¹¹
Parabola	51.03	1337
Cubic	87.83	11,160
Fifth order	-10.30	-1.608 X 10⁶
Sixth order	-4,978	-3.662 X 10⁸
Seventh order	-39,700	-1.668 X 10¹⁰
Eighth order	-159,000	-3.798 X 10¹¹
Fifteenth order	4.355 X 10¹⁰	5.756 X 10²²
Twentieth order	-5.857 X 10¹³	-4.471 X 10²⁹
Logistic	27.86	28.42

I hope you are saying wow. These equations all looked kind of similar from t = 0 to t = 50. Even at t = 100, we have mega-ginormous disagreements on what the energy usage will be: anywhere from the silly value of -6 X 10¹³ up to the tremendous value of 4 X 10¹⁰. I think I can safely say that I have made my point. The underlying choice of model might not matter much for interpolation, but for extrapolation, the choice of model can change an estimate by many orders of magnitude.

Actually, the only model that did not give huge answers for the 500 year estimate is the logistic model, the mathematical equation that was designed to model constrained growth. Hmmm…

Conclusion

Considerable effort usually goes into finding an equation that fits the existing data. Often, a variety of equations are tried and the one that best fits the existing data is chosen.

This typical process omits a crucial step. That step getting to know the data, understanding the natural constraints, and looking at the forces that drive the values up or down. This knowledge should drive the choice of mathematical model.

[1] Little bit of trivia here... the year before 1 AD was 1 BC, rather than 0. The correct difference between these two years is thus 1600, rather than 1601.

[2] I assume that there are an equal number of men and women. If there is only a single, very busy man, the women can be spared having to have twins.