Showing posts with label standard deviation. Show all posts
Showing posts with label standard deviation. Show all posts

Wednesday, July 23, 2014

Standard deviation - why the n and n-1?

When some people hear the word "deviant", they think about people who do stuff with Saran Wrap, handcuffs and Camembert cheese. But I'm a Math Guy, so I don't think about those things. I am reminded of statistics, not sadistics.


Which brings me around to a question that was asked of me by Brad:

I was trying to bone up on my stats knowledge the other day. I came across a few mentions of population vs sample. If someone states sigma vs. standard deviation is that dependent on this population vs. sample? 

And how does one determine population vs sample? Could you not consider everything just a sample? Seems like it could be very subjective.

The question leads to the great confusion between n and n-1. As I remember from my Statistics 101 class (back in 1843), there were two different formulae that were introduced for the standard deviation. One of the fomulae had n in the denominator, and the other had n-1. The first formula was called the population standard deviation, and the second was called the sample standard deviation.

(Quick note to the wise, if you wish to sound erudite and wicked smart, then spell the plural of "formula" with an "e" at the end. Spelling it "formulas" is so déclassé. I also recommend spelling "gray" with an e: grey. The Brits just naturally sound smart.)

So, what gives? Why divide by n-1?

Population standard deviation

Below we have the formula for the "population" standard deviation.

Formula for population standard deviation

You subtract the mean from all the samples, and square each difference. Squaring them does two things. First, it makes them all positive. After all, we want to count a negative deviation the same as a positive deviation, right? Second, it gives more weight to the larger deviations. 

Do you really want to give more weight to the larger deviations?  I dunno. Maybe. Depends? Maybe you don't. For some purposes, it might be better to take the absolute value, rather than the square. This leads to a whole 'nother branch of statistics, though. Perfectly valid, but some of of the rules change.

The squares of the deviations from the mean are then added up, and divide by the number of samples. This gives you the average of the squared deviations. That sounds like a useful quantity, but we want to do one more thing. This is an average of the squares, which mean that the units are squared units. If the original data was in meters or cubic millimeters, then the average of the squared deviations is in squared millimeters, or in squared cubic millimeters. So, we take the square root to get us back to the original units.

Sample standard deviation

And then there's the formula for the sample standard deviation. The name "sample" versus "population" gives some indication of the difference between the two types of standard deviation. For a sample standard deviation, you are sampling. You don't have all the data. 

That kinda makes it easy. In the real world, you never have all the data. Well... I guess you could argue that you might have all the data if you did 100% inspection of a production run. Then again, are we looking for the variation in one lot of product, or the variation that the production equipment is capable?  In general, you don't have all the data, so all you can compute is the sample standard deviation.

Formula for the sample standard deviation

Let's look at the other differences. Note that one population formula uses the symbol μ for the mean, and the sample standard deviation uses the symbol x-bar. The first symbol stands for the actual value of the average of all the data. The latter stands for an estimate of the average of all the data.

Estimate of the average?

I have a subtle distinction to make. We are used to thinking that the statistical mean is just a fancy word for "average", but there is a subtle difference. The average (or should I say "an" average) is one estimate of the mean. If I take another collection of data points from the whole set of them (if I sample the population), then I get another estimate of the mean.

One may ask "how good is this estimate? If you take one data point to compute the average (kind of a silly average, since there is only one) then you have no idea how good the average is. But if you have the luxury of taking a bunch of data points, then you have some information about how close the average might be to the mean. I'm, not being very statistical here, but it seems like a good guess that the true mean would lie somewhere between the smallest data point and the largest.

Let's be a bit more precise. If you sample randomly from the population, and if the data doesn't have a RFW distribution, and if you take at least a bunch of points, like ten or twenty, then there is something like a 68% chance that the true mean lies in the interval.

By the way, if you were wondering, the acronym RFW stands for Really GoshDarn Weird.

This is kind of an important result. If you wish to improve the statistical accuracy of your estimate of the mean by, for example, a factor of two, then you need to average four points together.If you want to improve your estimate by a factor of ten, you will need to average 100 data points.

Difference between sample and population standard deviation

Finally, I can state a little more precisely how to decide which formula is correct. It all comes down to how you arrived at your estimate of the mean. If you have the actual mean, then you use the population standard deviation, and divide by n. If you come up with an estimate of the mean based on averaging the data, then you should use the sample standard deviation, and divide by n-1.

Why n-1????  The derivation of that particular number is a bit involved, so I won't explain it. I would of course explain it if I understood it, but it's just too complicated for me. I can, however, motivate the correction a bit.

Let's say you came upon a magic lamp, and got the traditional three wishes. I would guess that most people would use the first couple of wishes on money, power, and sex, or some combination thereof. Or maybe something dumb like good health. But, I am sure most of me readers would opt for something different, like the ability to use a number other than x-bar (the average) in the formula for the sample standard deviation.

You might pick the average, or you might pick a number just a bit smaller, or maybe a lot larger. If you tried a gazillion different numbers, you might find something interesting. That summation thing in the numerator? It is the smallest when you happen to pick the average for x-bar.

This has an important ramification, since x-bar is only an estimate of the true mean. It means that if you estimate the standard deviation using n in the denominator, you are almost guaranteed to have an estimate of the standard deviation that is too low. This means that if we divide by n, we have a bias. The summation will tend to be too low. Dividing by n-1 is just enough to balance out the bias.

So, there is the incomplete answer.

Oh, one more thing... does it make a big difference? If you are computing the standard deviation of 10 points, the standard deviation will be off by around 5%. If you have 100 points, you will be off by 0.5%. When it comes down to it, that error is insignificant.

Wednesday, August 15, 2012

Deviant ways to compute the deviation

I once found a bug in Excel. This was a long time ago, and they have since fixed it. I also found that same bug in every calculator I could find. I found this ubiquitous bug because I was looking for it. You see, I had found that bug in my own software. Somehow, finding that bug elsewhere made me feel better.
I was computing the standard deviation of all the pixels in an image, so there were umpty-leven thousand data points in the computation1. I used the recommended method, which seemed like a good thing. Boy! Was I ever wrong! All of a sudden, I found I was asking the computer to find the square root of a negative number. And it wasn’t liking me.

No imagination allowed
I found the cause of the error, and realized it was bum advice from my college statistics book. Dumb old college, anyway!
Computation of standard deviation
There are two popular formulae for the computation of standard deviation. The first formula is the definition, given below. This formula is moderately understandable and is invariably the first formula given.
Another formula often follows immediately on the footsteps of the first:
Why this formula should be a measure of deviation from the mean is not immediately obvious, but the leading statistics texts generally devote some ink to showing that the two formulae are equivalent2. They will then point out that the second formula is more convenient to use, and requires less calculations. This rephrasing of the original equation seems to be the formula of choice.
I took a quick poll of the introductory statistics books at hand. The following eight books were sampled. (If a page number is given in parentheses, this is the page where the book introduces the second form of the standard deviation formula.)
    Bevington (p. 14)
    Freedman, Pisani and Purves (p. 65)
    Langley (p. 58)
    Mandel (not given)
    Mendenhall (p. 39)
    Meyer (not given)
    Miller and Freund (p. 156-7)
    Snedecor and Cochran (p. 32-3)
Two of the texts are clearly in love with this second formula.
[The second formula] “has the dual advantage of requiring less labor, and giving better accuracy.”
Miller, et al.
[The second formula] “tends to give better computational accuracy than the method utilizing the deviations.”
Mendenhall
There seems to be a strong consensus here. This would seem to be fact then, right?
Difficulty with standard formula
The second formula is often chosen because it requires only a single pass through the data, without having to remember the whole array. As you go through the data points, you sum the values (in order to compute the average), and you sum the squares of the values, and you count the number of data points. These three numbers plug into the second formula to give the standard deviation.
On a calculator, this is a very practical concern, especially back in my day when we used to have to make our own calculators with paper clicks, Elmer’s Glue, and used Popsicle sticks. It is (or was) considered a luxury to be able to store more than a few numbers in the memory. The second formulas managed to get by with storing only three numbers.
So, this would seem to be a recipe for paradise. Six out of eight stats textbooks recommend brushing with the second formula. On top of that, the second formula seems tailor made for a calculator with limited memory.
But, alas, there is trouble in paradise.
I reproduce here a quote from a book which I have found to be very practical-minded, and which has agreed with me every time that I have felt qualified to have an opinion.
[The second formula] “is generally unjustifiable in terms of computing speed and / or roundoff error.”
Press, et al., p. 458
Even more to the point is the following quote
“Novice programmers who calculate the standard deviation of some observations by using the [second] formula...often find themselves taking the square root of a negative number! A much better way to calculate means and standard deviations is to use the recurrence formulas...”
Knuth3, p 216
I believe we have some disagreement here. These two texts are clearly in the minority. But since neither one of these books is actually a statistics books, maybe we should just ignore them? Except for the fact that the minority is right. I became a believer because I was once a novice programmer who found himself taking the square root of a negative number4.
The little difficulty of the second formula comes from a recurring issue that causes numerical analysts to wake up in the middle of the night in a cold sweat: the loss of precision caused by subtracting two numbers that are close in size. If the average of the data is quite large compared to the standard deviation, or if n is quite large, then the second formula will require the difference to be computed between two large and nearly equal numbers. You need a whole lot of precision to pull this off.
To illustrate, consider the following data set of three points: (1,000,001, 1,000,002, and 1,000,003). The average of the data is obviously 1,000,002. We can easily compute the correct standard deviation from the first equation:
errata - the third term should be (1000003 - 1000002)^2
Using the second equation for computing the data set, we first compute the sum of the squares of the data: 3,000,012,000,014. (Gosh, that’s a big number!) Then we compute n times (the average of the data squared): 3,000,012,000,012.
We next take the difference between these two numbers. Provided we have at least 13 digits of precision, we get the correct difference of 2, and subsequently the correct standard deviation of 1. But with less than 13 digits of precision, the difference could be negative or positive, but probably isn’t 2. This is what Press et al., and Knuth were warning of.
Two pass algorithm
The most obvious way to avoid this bug is to go back to the original formula. This algorithm requires two passes through the data: the first pass to compute the mean, and the second pass to compute the variance.
This method has the disadvantage of requiring the data to be retained for a second pass. You just can’t do that in a calculator with limited memory. Knuth (p. 216) suggested a different option.
The recursive algorithm
Now we are ready for the fun part. Let’s say that we have the average of 171 numbers, and we want to average one more number in. Do we have to start over at the beginning and add up all 172 numbers? Or is there some shortcut? Is there ever an end to rhetorical questions?
The answers are no, yes, and “why do you ask so many questions?”
The really fun part is that the sum of the first 171 numbers is just 171 times the average. To get the average of the 172 numbers, you multiply the average of the first 171 numbers by 171. Then you add the 172nd number, and divide by 172. Diabolically clever, eh?
Put in algebraic form, if the average of the first j numbers is mj,
Then the average of the first j+1 numbers is given by this formula
with m1 = x1.
This is known as a recursive formula, one that builds on previous calculations rather than one that starts from scratch. The recursive formula for the average is pretty simple to understand intuitively. There is a bit more algebra in between, but it is possible to calculate the standard deviation in a recursive manner as well. If you know the standard deviation and mean of the first 171 data points, you can calculate the standard deviation of the set of 172 data points from this.
Here is the recursive formula for the variance, which is the square of the standard deviation:
with
This is the formula for the variance. The standard deviation is the square root of this.
Summary
This has been a little lesson in numerical analysis called “don’t subtract one big number from another”. And a little lesson in “don’t believe everything you read”. And lastly, a lesson in “recursion can be your friend”.
References
Bevington, Phillip, Data Reduction and Error Analysis for the Physical Sciences, McGraw-Hill, 1969
Freedman, Pisani and Purves, Statistics, 1978, W.W. Norton & Co.
Hamming, Richard, Numerical Methods for Scientists and Engineers, 1962, McGraw-Hill
Knuth, Donald E., Art of Computer Programming, Vol. 2, SemiNumerical Algorithms, 2nd edition, 1981, Addison Wesley
Langley, Russel, Practical Statistics Simply Explained, 1971, Dover
Mandel, John, The Statistical Analysis of Experimental Data, 1964, Dover
Mendenhall, William, Introduction to Probability and Statistics, 4th ed. 1975, Duxbury Press
Meyer, Stuart, Data Analysis for Scientists and Engineers, 1975, Wiley and Sons
Miller, Irwin, Freund, John, Probability and Statistics for Engineers, 3rd ed. 1985, Prentice Hall
Press, William, Flannery, Brian, Teukolsky, Saul, Vetterling, William, Numerical Recipes, the Art of Scientific Computing, Cambridge University Press, 1986
Snedecor, George, Cochran, William, Statistical Methods, 7th ed. 1980, Iowa State University Press
1) To be honest, these were the days when a big image was 640 by 480, so there weren’t quite that many data points.
2) Since the derivation is so widely available, it will not be reproduced here. Suffice it to say that the squared quantity in the first formula is expanded, and the summation is broken into three summations, some of which can be simplified by replacement with the definition of the average.
3) Donald Knuth, was born, by the way, in Milwaukee, WI. I currently live in Milwaukee. His father ran a printing business in Milwaukee. I work for a little mom and pop printing company in Milwaukee. Need I go on?
4) I was once a novice programmer, but I don't spend much time programming anymore.  I remain, however, a novice!