Wednesday, August 29, 2012

People do not make good statisticians


The lottery and gambling

The United States government has come to the realization that Japan is leading us in mathematical literacy. The government's approach to this, as with cigarettes and alcohol, is to attempt to change our behavior by putting a tax on what they don't like, in this case mathematical illiteracy. They call this tax the lottery.
Paraphrase of comedian Emo Phillips
Every American should learn enough statistics to realize that "One-in-25,000,000" is so close to "ZERO-in-25,000,000" that not buying a lottery ticket gives you almost virtually the same chance of winning as when you do buy one!
Mike Snider in MAD magazine, Super Special December 1995, p.48

I was in college when MacDonald's started their sweepstakes. Finding the correct gamepiece was going to make someone a millionaire. I had a friend named Peter[1] with a hunch. He was going to win.
I was, on the other hand, a math major. I considered my odds of being that one person in the United States who would be made incomprehensibly rich. There were a hundred million people trying to find that one lucky gamepiece. My chances were one in one hundred million of winning a million dollars. In my book, my long-run expectation was of my winning about a penny. Despite the fact that I was a poor student, scrounging to find tuition and rent, the prospect of winning (on average) one cent did not excite me. I was not about to go out of my way to earn this penny.
I was familiar with the Reader's Digest Sweepstakes. I had sat down and calculated the expected winnings in the sweepstakes. I expected to win something less than the price of the postage stamp I would need to invest in order to submit my entry, so I chose not to enter.
Peter was not a math major. Peter knew that if he was to win, he needed to put forth effort to appease the goddess Tyche[2]. Whenever we went out, whenever we passed the golden arches, he took us through the drive-through to pick up a gamepiece. Since future millionaires should not look like tightwads, he would order a little something. He would buy a soda and maybe an order of fries.
I took Peter to task for his silly behavior. I explained to him calmly the fundamentals of probability and expectation. I explained to him excitedly that he was being manipulated, being duped into spending much more money at MacDonald's than he would have normally. He told me that he would laugh when he received his one million dollars.
Did he win? No. In college, I took this as vindication that I was right. This event validated for me a pet theory: people are not good statisticians[3].
Our state (Wisconsin) has instituted Emo Phillips' tax on mathematical illiteracy. By not participating, I am a winner in the lottery. Profits from the lottery go to offset my property taxes. It is with mixed emotion that I receive this rebate each year. Like anyone else, I appreciate saving money. I even take a small amount of smug satisfaction that I win several hundred dollars a year from the lottery, and I have never purchased a lottery ticket. And I have made this money from people like Peter, who do not understand statistics.

One newsclip caught me in my smugness. The report characterized the typical buyer of a lotto ticket as surviving somewhere near the poverty level. I believe that we all have a right to decide where to spend our money. I don't think that the government should only sell lotto tickets to people who can prove that their income is above a certain level. I am, however, troubled by the image of my taxes being subsidized by an old woman who is just barely scratching out a living on a pension.
This image was enough for me to reconsider my mandate that people should base all their decisions on rational enumeration of the possible outcomes, assignation of probabilities, and computation of the expectation. What if I were the pensioner who never had enough money to buy a balanced diet after rent was paid? In the words of the song, "If you ain't got nothin', you got nothin' to lose."  Is the pensioner buying a lottery ticket because he or she is not capable of rationally considering the options? Or are all options "bad", so the remote chance of making things significantly different is worth the risk. Not too long ago, I would have blamed the popularity of the lottery on mathematical illiteracy. Today I am not so sure.

Psychological perspective

Where observation is concerned, chance favors only the prepared mind.
Louis Pasteur
Aristotle maintained that women have fewer teeth than men; although he was twice married, it never occurred to him to verify this statement by examining his wives' mouths.
Bertrand Russel, The Impact of Science on Society
As engineers and scientists, we like to consider ourselves to be unbiased in our observations of the world. Worchel and Cooper (authors of the psychology text I learned from) lend support for our self-evaluation:
[Studies] demonstrate that if people are given the relevant information, they are capable of combining it in a logical way...
If we read on, we are given a different perspective of the ability of the human brain to tabulate statistics:
But will they? ... We know from studies of memory processes and related cognitive phenomena that information is not always processed in a way that gives each bit of information equal access and usefulness.
Worchel and Cooper go on to describe experimental evidence of people not weighing all data equally. Furthermore, we tend to be biased in our judgment of an event when we are involved in that event, our placement of blame in an accident depends on the extent of damages, and we generally weight a person's behavior higher than we weight the particular situation the person is in.

The primacy effect

Several other rules can be invoked to explain our faulty data collection. The first rule to explain what information is retained is the primacy effect. This states that the initial items are more likely to be remembered. This fits well with folklore like, "You never get a second chance to make a first impression," and "It is important to get off on the right foot." Statistically speaking, the primacy effect can be thought of as applying a higher weighting on the first few data points.
In one experiment of the primacy effect, the subject is shown a picture of a person, and is given a list adjectives describing this person. The order of the adjectives is changed for different subjects. After seeing the picture and word list, the subject is asked to describe the person. The subject's description most often agrees with the first few adjectives on the list.

The recency effect

The second rule to explain memory retention is the recency effect. This states that, for example, the last items on a list of words (the most recently seen items) are also more likely than average to be remembered. In other words, the most recent data points are also more heavily weighted than average. As an example of this, I remember what I had for lunch today, but I can barely remember what I had the day before. If my doctor were to ask me what I normally had for lunch, would my statistics be reliable?

The novelty effect


The third rule states that items or events which are very unusual are apt to be remembered. This is the novelty effect. I once had the pleasure to work in a group with a gentleman who stood 6'5". When he was standing with some other team members who were just over six foot, a remark was made that we certainly had a tall team. In going over the members of this team, I recall four men who were 6'2" or taller. But I also remember a dozen who were an average height of 5'8" to 6', and I recall two others who were around 5'4". The novelty of a man who was seven inches above average, and the image of him standing with other tall men, was enough to substitute for good statistics.
I recall one incident where a group of engineers was just beginning to get an instrument close to specified performance. The first time the instrument performed within spec, we joked that this performance is "typical". The second time the instrument performed within spec (with many trials in between), we upgraded the level of performance to "repeatable". The underlying truth of this joking was the tendency for all of us to only remember those occasions of extremely good performance.

The paradigm effect

A fourth rule which stands as a gatekeeper on our memory is the paradigm effect. This states that we tend to form opinions based on initial data, and that these opinions filter further data which we take in. An example of the paradigm effect will be familiar to anyone who has struggled to debug a computer program, only to realize (after reading through the code countless times) the mistake is a simple typographical error. The brain has a paradigm of what the code is supposed to do. Each time the code is read, the brain will filter the data which comes in (that is, filter the source code) according to the paradigm. If the paradigm says that the index variable is initialized at the beginning, or that a specific line does not have a semi-colon at the end, then it is very difficult to "see" anything else.

The paradigm effect is more pervasive than any objective researcher is willing to admit. I have found myself guilty of paradigms in data collection. I start an experiment with an expectation of what to see. If the experiment delivers this, I record the results and carry on with the next experiment. If the experiments fails to deliver what I expect, then I recheck the apparatus, repeat the calibration, double check my steps, etc. I have tacitly assumed that results falling out of my paradigm must be mistakes, and that data which fits my paradigm is correct. As a result, data which challenges my paradigm is less likely to be admitted for serious analysis.
An engineer by the name of Harold[4] had built up some paradigms about the lottery. He showed me that he had recorded the past few month's of lottery numbers in his computer. He showed me that three successive lottery numbers had a pattern. When he noticed this, Harold bought lots of lottery tickets. The pattern unfortunately did not continue into the fourth set of lottery numbers. As Harold explained it to me, "The folks at the lottery noticed the pattern and fixed it."
Harold's paradigm was that there were patterns in the random numbers selected by lottery machines. Harold had two choices when confronted with a pattern which did not continue long enough for him to get rich. He could assume that the pattern was just a coincidence, or he could find an explanation why the pattern changed. In keeping true to his paradigm, Harold chose the latter. When he explained this to me, I realized that it was fruitless to try to argue him out of something he knew to be true. I commented that the folks at the lotto had bigger and faster computers than Harold, just so they could keep ahead of him.
As another example of the paradigm effect, consider an engineer named William[5]. William was a heavy smoker and had his first heart attack in his mid-forties. He was asked once why he kept smoking, when the statistics were so overwhelming that continuing to smoke would kill him. William replied that his heart attack was due to stress. Smoking was his way of dealing with stress. To deprive himself of this stress relief would surely kill him. Furthermore, stopping smoking is stressful in and of itself.
William's paradigm was that he was a smoker. No amount of evidence could convince him that this was a bad idea. Evidently the paradigm is quite strong. In a recent study, roughly half of bypass patients continue to smoke after the surgery. William had six more heart attacks and died after his third stroke.

The primacy effect and the paradigm effect working together

The primacy effect and the paradigm effect often work together to make us all too willing to settle for inadequate data. My own observation is that people often settle for a few data points, and are often surprised to find out how shaky their observation is, statistically speaking.
A case in point is my belief that young boys are more aggressive than young girls. The first young girls I had opportunity to closely observe were my own two daughters, who I would not call aggressive. The first young boy I observed in any detail was the neighbor's, who I would call aggressive.  My conclusion is that young boys are aggressive, and young girls are not.
Note that, if three people are picked at random, it is not terribly unlikely that the first person chosen is aggressive, and the other two are not. In other words, I have no need to appeal to a correlation between gender and aggressiveness to explain the data. The simple explanation of chance would suffice.
The primacy effect says that these three children were the most influential in shaping my initial beliefs. The paradigm effect says that the future data which I "record" will be the data which supports my initial paradigm.
In terms of evolution, one would be tempted to state that an animal with poor statistical abilities would not be as successful as an animal which was capable of more accurate statistical analysis. Surely the hypothetical Homo statistiens would be able to more accurately assess the odds of finding food or avoiding predators.
Consider the hypothetical Homo statistiens first encounter with a saber toothed tiger. Assume that he/she was lucky enough to survive the encounter. On the second encounter, Homo statistiens would reason that not enough statistics were collected to determine whether saber toothed tigers were dangerous. Any good statistician knows better than to draw any conclusions from the first data point. Clearly, there is an evolutionary advantage to Homo sapiens, who jumps to conclusions after the first saber toothed tiger encounter.

In the words of Desmond Morris,
Traumas... show clearly that the human animal is capable of a rather special kind of learning, a kind that is incredibly rapid, difficult to modify, extremely long-lasting and requires no practice to keep perfect.

The effect of peer pressure


When Richard Feynman was investigating the Challenger disaster, he uncovered another fine example of how poor people are at statistics. He was reading reports and asking questions about the reliability of various components of the Challenger, and found some wild discrepancies in the estimated probabilities of failure. In one meeting at NASA, Feynman asked the three engineers and one manager who were present to write down on a piece of paper the probability of the engine failing. They were not to confer, or to let the others see their estimates. The three engineers gave answers in the range of 1 in 200 to 1 in 300. The manager gave an estimate of 1 in 100,000.
This anecdote illustrates the wide gap in judgment which Feynman found between management and engineers. Which estimate is more reasonable? Feynman dug quite deeply into this question. He talked to people with much experience launching unmanned spacecraft. He reviewed reports which analytically assessed the probability of failure based on the probability of failure of each of the subcomponents, and of each of the subcomponents of the subcomponents, and so on. He concludes:
If a reasonable launch schedule is to be maintained, engineering often cannot be done fast enough to keep up with the expectations of the originally conservative certification criteria designed to guarantee a very safe vehicle... The shuttle therefore flies in a relatively unsafe condition, with a chance of failure on the order of a percent.
On the other hand, Feynman is particularly candid about the "official" probability of failure:
If a guy tells me the probability of failure is 1 in 105, I know he's full of crap.
How can it be that the bureaucratic estimate of failure disagrees so sharply with the more reasonable engineer's estimate? Feynman speculates that the reason for this is that these estimates need to be very small in order to ensure continued funding. Would congress be willing to invest billions of dollars on a program with a one in a hundred chance of failure? As a result, much lower probabilities are specified, and calculations are made to justify that this level of safety can be reached.
I am reminded of another experiment which was devised by the psychologist Solomon Asch in 1951. In this experiment, the subject was told that this was an experiment investigating perception. The subject was to sit among four other "subjects", who are actually confederates. The "subjects" were shown a set of lines on a piece of paper (for example) and are asked to state out loud which line was longest. The actors were called on first, one at a time. They were instructed to give obviously incorrect answers in 12 of 18 trials, but they were to all agree on the incorrect answer.
It was found in that 75% of subjects caved into peer pressure, and agreed with the obviously incorrect answers. When asked about their answers later, away from the immediate effects of peer pressure, the subjects held to their original answers, incorrect or not. As far as can be measured with this psychological experiment, the subjects came to believe that a two inch long line was shorter than a one inch long line.
So it is with NASA's reliability data. The data may never have had any shred of credence whatsoever, but simply by repeating "1 in 100,000" often enough, it became truth.
I have included Feynman's example not to put down NASA, or promote the ever-popular game of "manager bashing", but to illustrate this ever-so-human trait that we are all prone to. We believe what others believe, and we believe what we would like to be true.

Summary

The effects mentioned here together support the statement that people do not make good statisticians. The point which is made is not that "mathematically inept people are poor statisticians", or that "people are incapable of performing good statistics". The point is that the natural tendency is for people to not be good at objectively analyzing data. This goes for high school drop-outs as well as engineers, scientists and managers. In order for people to produce good statistics, they need to rely not on their memory and intuition, but on paper and statistical calculations.

Bibliography

Feynman, Richard P., What do you care what other people think?, 1988, Penguin Books Canada Ltd.
Flanagan, Dennis, Flanagan's Version, 1989, Random House
Kresch, David, Crutchfield, Richard S., Livson, Norman, Elements of Psychology, Third Edition, 1974 Alfred Knopf, Inc.
Morris, Desmond, The Human Zoo, 1969, McGraw Hill
Worchel, Stephen, and Cooper, Joel, Understanding Social Psychology, revised edition 1979, Doresy Press



[1] Not his real name.
[2] Tyche was the Greek goddess of luck.
[3] I met one person whose behavior indicated that he was not a good statistician, therefore all people are not good statisticians...[This demonstrates that I am not a good statistician, since I am content with a sample size of . There are therefore, two people who are not good statisticians, and this further proves my point.]
[4] Not his real name, either.
[5] You guessed it. Not his real name.

Wednesday, August 22, 2012

One Beer's law too many

Some people may think that Beer’s law has to do with underage drinking, and that August Beer is what comes before OktoberFest. Beer’s law is, however, one of the coolest laws of photometry, and August Beer is the guy who it is named after. (For a complete discussion of how it got that name, skip to the end of this blog post.

This blog post is a re-enactment of a seminal experiment that a preeminent researcher reported on back in 1995. This phenomenal scientist has had such a profound influence on the worlds of printing and colorimetry, that I am tirelessly committed to the promulgation of his work. I am speaking, of course, about myself.  
Experimental setup
The picture below details the equipment to be used in this experiment. At left is a constant current power supply, which provides power for the blue Luxeon LED. This LED shines into the optical assembly, which is supported by one of the biggest books I have on color science. At the far right is the sensor for an expensive light sensor, with the control unit show on the expensive black carpet. The observant reader will no doubt be impressed by the huge expense that I must have gone through to dig this pile of junk out of my basement.
Expensive equipment used for this experiment
The lights were turned down and the system calibrated so that the light meter read 100.0 banana units when there was nothing between the light source and the detector, as shown below.
Expensive optical stuff, bored, with nothing to read
Now the party begins. I cracked open a cold one and set it in the beam. Note that the reading has dropped to 90.0 banana units, indicating that 10.0 banana units of light got caught by the amber fluid and never quite made it home. I can definitely identify with these photons.
Same set up, but with one sample cell
As they say, you can’t milk a camel while standing on one leg, so let’s order another one. But before it gets set down on the bar, let’s take a guess at what the light meter will read. Hmmm…. The first sample dropped it by 10.0, so it would make sense that the second one would so the same. My guess at the results: 80.0.
Same set up, only this time with two samples
For those of you who agreed with my guess, it was commendable, but wrong. There was indeed a pattern established, but not the one you were thinking of. Why did it go down to 81.0, instead of 80.0? For every 100 photons that entered the first sample, 10 of them were absorbed, and 90 were transmitted on to the second sample. Upon reaching the second sample, the same probabilities apply. Of the 90 photons that made it to the second sample, 90% of those made it out, so that there were 81.
You now know Beer’s law.
But just to make sure the concepts are all down, let’s take this one step further. How about three samples? 90% X 90% X 90% = 72.9%, as verified by the highly sensitive experimental set up below.
Results for three samples
One last thing… Can you guess what kind of beer was used?
Miller Lite – the official beer of color scientists everywhere
Disclaimers – Do not try this experiment at home. I am a trained professional. The mixing of beer and scientific equipment is not recommended. No beer was wasted in the photoshoot for this blog. I cannot say the same for the scientist who performed the experiment.
Who invented Beer’s law, anyway?
Some folks may have just assumed that Beer’s law was named after William Gosset, who was a pioneer in statistics, and worked for Guinness. That would be a good guess, since he was a smart guy. It would have been just like him to have a really cool law of physics named after him, since he invented the t test, which was named after Student, which was actually his pen name. But that’s another interesting story.
The guess is unfortunately wrong, since Beer’s law was named after August Beer. This is yet another in my series of mathematical misnomers.
This law of physics was first discovered by the father of photometry Pierre Bouguer in 1729. August beer didn’t discover this law until over a century later in 1852. Beer worked with  Johann Heinrich Lambert on a book (“Introduction to the Higher Optical”) that was published in 1860. So naturally, the law has become known as “Beer’s law”, “Beer-Lambert law”, “Beer Lambert-Bouguer law”, “Lambert-Bouger law”, “Lambert’s law”, and “Bob”.
Why is it known in the printing industry as “Beer’s law”? There are two key influences that led to this egregious misnomer. The first was a landmark 1967 book by J.A.C. Yule, “Principles of Color Reproduction”. Any book with the word “reproduction” in the title is apt to move quickly. I just checked with Amazon. They only have two copies left.
The second thing that probably had an even greater effect was the frequent use of the eponym by the eminent applied mathematician, color scientist, mathematics historian, and all around nice looking guy, John “the Math Guy” Seymour. He has made no bones about why he decide on this name among all the potential candidates. I quote here from his paper delivered at the 2007 Technical Association of the Graphic Arts:
Since there seems to be little agreement about who is responsible for which law, I have chosen to refer to the statement that optical densities of filters add as Beer’s law. My decision is not based on historical evidence, but on the gedanken I introduced in a paper given at IS&T (Seymour, 1995). In this, I demonstrated the law by using a varying number of mugs filled with beer. My hope is that my further corruption of already corrupt historical fact will help remember the law!
Brilliant words by a brilliant man, indeed.

Wednesday, August 15, 2012

Deviant ways to compute the deviation

I once found a bug in Excel. This was a long time ago, and they have since fixed it. I also found that same bug in every calculator I could find. I found this ubiquitous bug because I was looking for it. You see, I had found that bug in my own software. Somehow, finding that bug elsewhere made me feel better.
I was computing the standard deviation of all the pixels in an image, so there were umpty-leven thousand data points in the computation1. I used the recommended method, which seemed like a good thing. Boy! Was I ever wrong! All of a sudden, I found I was asking the computer to find the square root of a negative number. And it wasn’t liking me.

No imagination allowed
I found the cause of the error, and realized it was bum advice from my college statistics book. Dumb old college, anyway!
Computation of standard deviation
There are two popular formulae for the computation of standard deviation. The first formula is the definition, given below. This formula is moderately understandable and is invariably the first formula given.
Another formula often follows immediately on the footsteps of the first:
Why this formula should be a measure of deviation from the mean is not immediately obvious, but the leading statistics texts generally devote some ink to showing that the two formulae are equivalent2. They will then point out that the second formula is more convenient to use, and requires less calculations. This rephrasing of the original equation seems to be the formula of choice.
I took a quick poll of the introductory statistics books at hand. The following eight books were sampled. (If a page number is given in parentheses, this is the page where the book introduces the second form of the standard deviation formula.)
    Bevington (p. 14)
    Freedman, Pisani and Purves (p. 65)
    Langley (p. 58)
    Mandel (not given)
    Mendenhall (p. 39)
    Meyer (not given)
    Miller and Freund (p. 156-7)
    Snedecor and Cochran (p. 32-3)
Two of the texts are clearly in love with this second formula.
[The second formula] “has the dual advantage of requiring less labor, and giving better accuracy.”
Miller, et al.
[The second formula] “tends to give better computational accuracy than the method utilizing the deviations.”
Mendenhall
There seems to be a strong consensus here. This would seem to be fact then, right?
Difficulty with standard formula
The second formula is often chosen because it requires only a single pass through the data, without having to remember the whole array. As you go through the data points, you sum the values (in order to compute the average), and you sum the squares of the values, and you count the number of data points. These three numbers plug into the second formula to give the standard deviation.
On a calculator, this is a very practical concern, especially back in my day when we used to have to make our own calculators with paper clicks, Elmer’s Glue, and used Popsicle sticks. It is (or was) considered a luxury to be able to store more than a few numbers in the memory. The second formulas managed to get by with storing only three numbers.
So, this would seem to be a recipe for paradise. Six out of eight stats textbooks recommend brushing with the second formula. On top of that, the second formula seems tailor made for a calculator with limited memory.
But, alas, there is trouble in paradise.
I reproduce here a quote from a book which I have found to be very practical-minded, and which has agreed with me every time that I have felt qualified to have an opinion.
[The second formula] “is generally unjustifiable in terms of computing speed and / or roundoff error.”
Press, et al., p. 458
Even more to the point is the following quote
“Novice programmers who calculate the standard deviation of some observations by using the [second] formula...often find themselves taking the square root of a negative number! A much better way to calculate means and standard deviations is to use the recurrence formulas...”
Knuth3, p 216
I believe we have some disagreement here. These two texts are clearly in the minority. But since neither one of these books is actually a statistics books, maybe we should just ignore them? Except for the fact that the minority is right. I became a believer because I was once a novice programmer who found himself taking the square root of a negative number4.
The little difficulty of the second formula comes from a recurring issue that causes numerical analysts to wake up in the middle of the night in a cold sweat: the loss of precision caused by subtracting two numbers that are close in size. If the average of the data is quite large compared to the standard deviation, or if n is quite large, then the second formula will require the difference to be computed between two large and nearly equal numbers. You need a whole lot of precision to pull this off.
To illustrate, consider the following data set of three points: (1,000,001, 1,000,002, and 1,000,003). The average of the data is obviously 1,000,002. We can easily compute the correct standard deviation from the first equation:
errata - the third term should be (1000003 - 1000002)^2
Using the second equation for computing the data set, we first compute the sum of the squares of the data: 3,000,012,000,014. (Gosh, that’s a big number!) Then we compute n times (the average of the data squared): 3,000,012,000,012.
We next take the difference between these two numbers. Provided we have at least 13 digits of precision, we get the correct difference of 2, and subsequently the correct standard deviation of 1. But with less than 13 digits of precision, the difference could be negative or positive, but probably isn’t 2. This is what Press et al., and Knuth were warning of.
Two pass algorithm
The most obvious way to avoid this bug is to go back to the original formula. This algorithm requires two passes through the data: the first pass to compute the mean, and the second pass to compute the variance.
This method has the disadvantage of requiring the data to be retained for a second pass. You just can’t do that in a calculator with limited memory. Knuth (p. 216) suggested a different option.
The recursive algorithm
Now we are ready for the fun part. Let’s say that we have the average of 171 numbers, and we want to average one more number in. Do we have to start over at the beginning and add up all 172 numbers? Or is there some shortcut? Is there ever an end to rhetorical questions?
The answers are no, yes, and “why do you ask so many questions?”
The really fun part is that the sum of the first 171 numbers is just 171 times the average. To get the average of the 172 numbers, you multiply the average of the first 171 numbers by 171. Then you add the 172nd number, and divide by 172. Diabolically clever, eh?
Put in algebraic form, if the average of the first j numbers is mj,
Then the average of the first j+1 numbers is given by this formula
with m1 = x1.
This is known as a recursive formula, one that builds on previous calculations rather than one that starts from scratch. The recursive formula for the average is pretty simple to understand intuitively. There is a bit more algebra in between, but it is possible to calculate the standard deviation in a recursive manner as well. If you know the standard deviation and mean of the first 171 data points, you can calculate the standard deviation of the set of 172 data points from this.
Here is the recursive formula for the variance, which is the square of the standard deviation:
with
This is the formula for the variance. The standard deviation is the square root of this.
Summary
This has been a little lesson in numerical analysis called “don’t subtract one big number from another”. And a little lesson in “don’t believe everything you read”. And lastly, a lesson in “recursion can be your friend”.
References
Bevington, Phillip, Data Reduction and Error Analysis for the Physical Sciences, McGraw-Hill, 1969
Freedman, Pisani and Purves, Statistics, 1978, W.W. Norton & Co.
Hamming, Richard, Numerical Methods for Scientists and Engineers, 1962, McGraw-Hill
Knuth, Donald E., Art of Computer Programming, Vol. 2, SemiNumerical Algorithms, 2nd edition, 1981, Addison Wesley
Langley, Russel, Practical Statistics Simply Explained, 1971, Dover
Mandel, John, The Statistical Analysis of Experimental Data, 1964, Dover
Mendenhall, William, Introduction to Probability and Statistics, 4th ed. 1975, Duxbury Press
Meyer, Stuart, Data Analysis for Scientists and Engineers, 1975, Wiley and Sons
Miller, Irwin, Freund, John, Probability and Statistics for Engineers, 3rd ed. 1985, Prentice Hall
Press, William, Flannery, Brian, Teukolsky, Saul, Vetterling, William, Numerical Recipes, the Art of Scientific Computing, Cambridge University Press, 1986
Snedecor, George, Cochran, William, Statistical Methods, 7th ed. 1980, Iowa State University Press
1) To be honest, these were the days when a big image was 640 by 480, so there weren’t quite that many data points.
2) Since the derivation is so widely available, it will not be reproduced here. Suffice it to say that the squared quantity in the first formula is expanded, and the summation is broken into three summations, some of which can be simplified by replacement with the definition of the average.
3) Donald Knuth, was born, by the way, in Milwaukee, WI. I currently live in Milwaukee. His father ran a printing business in Milwaukee. I work for a little mom and pop printing company in Milwaukee. Need I go on?
4) I was once a novice programmer, but I don't spend much time programming anymore.  I remain, however, a novice!

Wednesday, August 8, 2012

The laws of planned spontaneity


The phrase “planned spontaneity” sounds at first like an oxymoron. “Planned” and “spontaneous” are polar opposites. I have coined this phrase in an effort for me to understand why I have seen spontaneity sometimes work and sometimes fail. A consideration of the following vignettes from my life and an analysis of them leads to some suggestions to make effective use of spontaneity.

My brief notoriety as a chess player

When I was in Junior High, I experienced some success as a chess player. I joined chess club, and played chess whenever I could. Over a period of two years, I kept track of my wins and losses, and found I won two out of every three games.
In our chess club was a fellow by the name of David H. He was the undisputed master of the chessboard. His father had been grand master for the state of Wisconsin. David had been taught chess in the cradle. He was a walking encyclopedia of chess games. One tactic he used to unnerve his opponents was to reply to your move by saying something like, “Ah, yes! Spasky vs. Hilmer, 1963.” I suspected he made it all up.
Another tactic he would use would be to say “mate in six” right after his move. That meant that in six moves he would checkmate you. Against most players, his predictions were accurate. Against me, his predictions were frequently wrong. That is not to say that I was a better player than he, or than his other competitors. He eventually prevailed in almost every game against me, and my record against others was not spectacular.
My brief notoriety came when I actually won against David H. I was the only person in the history of our chess club to checkmate him.
David was showing off. He was playing against four other people on four other boards. I caught him in “Fool's Mate”. Fool's mate is a short sequence of moves at the the beginning of the game which is easily defended against if noticed, and easily noticed if you are watching. No self-respecting chess master would think of using Fool's Mate. David was certainly not expecting me to pull it on him.
David and I had completely different chess styles. David carefully planned out his strategy. He considered not only his own moves many moves in advance, but considered counter-moves that I might make. He sporadically spent long periods of time staring at the board. I played chess by the seat of my pants. I concentrated on the strategy of getting my pieces out where I might need them. I had a simple approach: I developed key pieces, and I looked for serendipitous opportunities.
I have no delusions about being as good a chess player as David H. I was completely out of his league. I realize that I could not possibly hope to compete with him without radically changing my strategy. Here we come to law #1: In a controlled environment (such as a chess game), there is no strategy as good as careful planning.
Even though David was far more skilled than I, I was often able to temporarily throw him for a loop by doing something unexpected. Because of my unorthodox strategy, I frequently loused up his careful plans by doing something out of the blue. In a tightly structured game like chess, this unorthodox approach cannot be successful for long. But as a game becomes more random, planning is more often thwarted, and the player who keeps an eye open for opportunities is rewarded. Law #1a: In an unpredictable environment, careful planning is not nearly as good as spontaneity.

My favorite hobby: cooking

When my first wife (not her real name) and I were married, I did all the cooking, and initially we did grocery shopping together. She (always the practical one) found this very stressful, since I purchased a lot of impulse items. This was great for me, because I truly enjoyed cooking, and we always had a good supply of weird things to cook. The only drawback for me was that I would often need to make a trip to the store while I was preparing dinner to get one little thing I forgot. My wife's sensibleness, and our budget, soon prevailed and my wife went shopping by herself.
I think I still have a jar of pickled okra which I purchased on impulse over ten years ago. This jar stands as a testament of law #2: Pure spontaneity does not work. Not only do you spend money on things that aren't used, but you often miss some of the things you need.
The ensuing period in our marriage was a gastronomical low. My wife did not enjoy cooking, and seldom cooked. Because of this, she was ill-suited to predict what I might need in making supper. We ate a lot of boring meals. I found that preparing Swanson TV dinners did little to sate my appetite for creativity.
Eventually, we came to a compromise in the form of a grocery list. When I spontaneously came up with an idea for a meal, or used up the last of the mushrooms, all I needed to do was write down what I wanted on a sheet of paper. So long as I managed to get the item on the list before Saturday afternoon, it would appear in the refrigerator by Saturday evening. The idea allowed me to prepare the cupboards for spontaneous cooking, and it allowed my wife to shop frugally. The grocery list also served to filter out the more ludicrous impulse items, such as curried frog eyeballs. [If I didn't see them staring at me out of the jar on the store's shelf, I could spend weeks not thinking about curried frog eyeballs, and they would not make it on the list.]
For me, this was a lesson in planned spontaneity. I had learned that I can only practice spontaneous cooking if a small amount of planning went into stocking the shelves with the right building blocks. Spices, flour, tomato paste, and fresh vegetables are some of the right building blocks for spontaneous cooking. Chef Boyardee ravioli is not.
An important tenet of planned spontaneity is having the resources to be spontaneous. In cooking, this tenet translates to making sure there is a ready stock of staples. In chess, this tenet translates to putting chess-pieces where they might be useful. In war, this translates to Charlemagne's words, “It is smarter to be lucky than it's lucky to be smart.” Law #3: Luck is planned resourcing for spontaneity.

Fixing vacuum cleaners

When my first wife (still not her real name) and I moved (umpteen years ago, now), she carted the old vacuum cleaner to the curb for the garbage men to haul away. The hose had fallen apart, and it did not have a nifty beater bar like our new vacuum. She saw this as useless junk that we would have to both move and find space for in the new house. In her structured and efficient manner, she was making life easier for us.
I saw the vacuum cleaner as an important resource for spontaneity. I pulled it out of the pile of junk and loaded it into the moving van. In my wife's structured and efficient manner, she did not make an issue out of it.
It came to pass in our new house that the motor of our new vacuum cleaner burned out. In the true spirit of planned spontaneity, I saw this as an opportunity. I dug out the old vacuum cleaner with the broken hose. I found that the motors in the two units were identical, so I replaced the motor in our new vacuum cleaner. We got several more year's use out of the vacuum cleaner, and I dutifully saved the electrical cord and switch out of the old one, just in case I need them.
This story illustrates law #4: Structure and efficiency work to destroy the resources for spontaneity.

Review of the laws

In the interest of clarity, I have listed the laws of planned spontaneity below. Note that these are not laws in the “Thou shalt not”, or “Thou shalt” sense. The word “should” and “must” do not appear in the laws. These laws are observations about how things work. We cannot talk about breaking any of these laws any more than we can talk about breaking the law of gravity.
Law #1: In a controlled environment (such as a chess game), there is no strategy as good as careful planning.
Law #1a: In an unpredictable environment, careful planning is not nearly as good as spontaneity.
Law #2: Pure spontaneity does not work.
Law #3: Luck is planned resourcing for spontaneity.
Law #4: Structure and efficiency work to destroy the resources for spontaneity.

Wednesday, August 1, 2012

A witch by any other name


This function appeared as an example in a previous post on regression. Aside from that interesting bit of trivia, this function has an interesting history under a variety of aliases and misnomers. The function has surfaced in several different fields, apparently with little cross-pollination.
Witch of Agnesi
The function first occurred as a solution to a geometric puzzle.
The curve called the Witch of Agnesi is defined as a locus of points based on a circle. The figure below shows how this locus of point is derived. A lines is drawn from a point A on the circle so that it intersects first the circle (at B), and then the tangent line (at C). This line is tangent to the circle directly opposite point A. A line perpendicular to the tangent line is drawn through C (line CD), and a line parallel to the tangent line is drawn through B (line BD). The intersection of these two lines is one point on the witch of Agnesi. The witch is made up of all such points D as created by all lines that go through A.

If the circle has radius a, the curve has the formula 

History of the witch
It has been variously reported that Fermat (1601 – 1665) had described this curve well before Agnesi [Maor, Osen, MacTutor on Agnesi, Stigler 1974]. Cajori, however, elaborates that Fermat had worked on was a related curve:

Note that this function is shaped quite differently from the witch because of the minus sign in the denominator. Fermat’s function has two poles. Thus, there is some question about whether Fermat had actually worked on the witch.
Stigler [1974, and 1999] also reports that Newton had worked on this curve some time before 1718, but that this work was not published until 1779 (posthumously). Stigler does not identify the work, but it could have been “Geometrica Analytica”.
Stigler also added Leibniz and Huygens to the list of early investigators.
A mathematics professor at the University of Pisa by the name of Luigi Guido Grandi offered a construction of the curve in 1703 and 1710 [Cajori, MacTutor on Grandi and Agnesi]. Grandi referred to the curve as the versiera, from the Latin verb for “turn”.
Maria Agnesi is indirectly responsible for the popularization of the name “witch of Agnesi”. She wrote a very popular calculus textbook in 1748. The two volume set was a unified treatment of algebra and the fledgling subject of calculus. In this book, she referred to the curve as the versiera, as had Grandi.
The curve became known as a “witch” due to a mistranslation of Agnesi’s textbook. The British mathematician John Colson translated Agnesi’s work into English sometime before 1760, but this was not published until 1801 [MacTutor on Agnesi]. He learned Italian specifically for this task, so it is understandable that he made some translation errors.  He mistook versiera for avversiera, which means “devil woman”, or “witch”. Somehow this mistranslation stuck, and the curve became known as the witch of Agnesi.
Thus, we see that the name “witch of Agnesi” is both a mistranslation and a misnomer. Not only is it not a witch, but it was not invented by Agnesi. It might be more appropriate to refer to it as the “curve of Grandi”.
There are two additional names given to this curve: Cubique d’Agnesis and Agnésienne [Smith, and Wolfram MathWorld].
In a text by Longchamps and subsequently by Basset, there is a description of a different, but similar, derivation of the witch of Agnesi. In this case, the definition is such that the curve lay all below the top of the circle. The subsequent equation

is of the same shape as the versiera as defined by Agnesi. Both Longchamps and Bassett both refer to this as the witch of Agnesi. Loria disagrees with the name and states that this curve is not a versiera, but coins the term pseudo-versiera, and hence establishes another name for the function.
Cauchy distribution
The second appearance of this function is in the field of statistics, where it took on the name “Cauchy distribution”. The Cauchy distribution is a probability density function, similar to the Gaussian, or normal distribution. The formula for it is

The Cauchy distribution most commonly makes its appearance as an example of a pathological distribution. Despite its gross similarity to the normal curve (it has wider tails); it is as ill-behaved as Paris Hilton.
Technically speaking, it does not have a mean, since the integral used to compute the mean from a distribution is undefined. This, perhaps, is a technicality, since the distribution is symmetric about x = 0, so the mean could be defined as being 0.
More troublesome is the fact that the standard deviation of the distribution is infinite. Since the standard deviation of a distribution is a measure of its width, the Cauchy distribution paradoxically has an infinite width.
This pathological behavior of the Cauchy distribution makes it a wonderful example of when the central limit theorem does not apply. The central limit theorem states that the distribution of the sum of random numbers tends to look more and more like a Gaussian as more and more random numbers are added together. This applies for random numbers drawn from any distribution, provided that distribution has a finite, non-zero standard deviation.
On the other hand, if two samples from a Cauchy distribution are added, the distribution of the sum is another Cauchy distribution. It follows that the sum of an arbitrary number of Cauchy distributed variables also follows a Cauchy distribution.
Interestingly enough, calling this curve the Cauchy distribution is yet another misnomer, as I have reported earlier. According to Stigler, Poisson had published a paper in 1824 where he described how this was an example of a distribution where the central limit theorem did not work. Cauchy did not work with the distribution until 1853. It would then be more accurate to refer to this as the Poisson distribution, but of course that name has already been taken.
Lorentzian
The third place where the witch raised her pointed little hat is in the field of physics. Maor makes the following comment about the witch:
It is somewhat of a mystery why this particular curve, which rarely shows up in applications, has interested mathematicians for so long.
He does comment in a footnote that the witch is identical to the Cauchy distribution.
Maor’s claim about the witch seems to also hold true for the Cauchy distribution. A book by Trivedi is a practical book on statistics. A quick look at the index under the heading “distribution”, reveals 24 different distributions, but does not include the Cauchy distribution. It would seem that as a distribution, its only claim to fame is as an example of bad behavior.
But, I disagree with Maor’s comment that the witch rarely shows up in applications. People who deal with spectroscopy are familiar with this curve as the Lorentzian.
The IR spectrum of a molecule is used by chemists as a fingerprint to identify and quantify a compound. Each of the bonds in a molecule has a specific resonance, generally in the infrared. Under ideal conditions, these resonances show up as narrow spikes. As the molecules of a rarefied gas come closer together (higher pressure), collisions between the molecules will compress the molecular bonds by varying amounts. In this way, the spectral spikes are broadened into what is called the Lorentzian.
This spectral shape is named after the Nobel prize winning Dutch physicist Hendrik Lorentz. The formula also shows up in scattering theory, where it has become known as the Breit-Wigner formula.
A physicist is likely to parameterize the Lorentzian as

Here, the curve is centered at x0 and the “full width at half max” is w. The distance between the half-way points on either side of a peak is a convenient measure of the width. It is all the more convenient, since the width of a Lorentzian cannot be measured by the standard deviation. Also, this measure can be readily estimated from a plot.
Summary
Here is a list of the names given the curve:
    1) Versoria (Latin)
    2) Versiera (Italian)
    3) Witch of Agnesi
    4) Cubique d’Agnesi
    5) Agnesíenne
    6) Pseudo-versiera
    7) Cauchy distribution
    8) Lorentzian
    9) Breit-Wigner formula
It is interesting that I have not found a single reference that mentions all three of the main names (Witch of Agnesi, Cauchy distribution, and Lorentzian). I have only found references that include any two of the three.
Bibliography
Basset, Alfred Barnard, An Elementary Treatise on Cubic and Quartic Curves, Cambridge, 1901
Boyer, Carl, A History of Mathematics, second edition, John Wiley, 1991
de Longchamps, M. G., Essai sur la geometrie de la regle et de l'equerre, Paris, 1890
Loria, Gino, Spezielle algebraische und transscendente ebene kurven, B. G. Teubner, 1902
Maor, Eli, Trigonometric Delights, Princeton University Press, 1998, pps 108 – 111
Miller, Jeff, Earliest Known Uses of Some of the Words of Mathematics, http://members.aol.com/jeff570/w.html
Osen, Lynn M., Women in Mathematics, MIT Press, 1974
Singh, Simon, Fermat’s Enigma, Anchor Books, 1998
Smith, History of Mathematics, Vol II, Ginn and Company, 1953, p. 331
Stigler, Stephen M. Studies in the History of Probability and Statistics. XXXIII Cauchy and the Witch of Agnesi: An Historical Note on the Cauchy Distribution, Biometrika, Vol. 61, No. 2 (Aug., 1974), pp. 375-380
Stigler, Stephen M., Statistics on the Table, Harvard Press, 1999
Trivedi, Kishor Shridharbhai, Probability & Statistics with Reliability, Queueing and Computer Science Applications, 1982, Prentice Hall
Wolfram MathWorld, Witch of Agnesi,