Wednesday, July 23, 2014

Standard deviation - why the n and n-1?

When some people hear the word "deviant", they think about people who do stuff with Saran Wrap, handcuffs and Camembert cheese. But I'm a Math Guy, so I don't think about those things. I am reminded of statistics, not sadistics.


Which brings me around to a question that was asked of me by Brad:

I was trying to bone up on my stats knowledge the other day. I came across a few mentions of population vs sample. If someone states sigma vs. standard deviation is that dependent on this population vs. sample? 

And how does one determine population vs sample? Could you not consider everything just a sample? Seems like it could be very subjective.

The question leads to the great confusion between n and n-1. As I remember from my Statistics 101 class (back in 1843), there were two different formulae that were introduced for the standard deviation. One of the fomulae had n in the denominator, and the other had n-1. The first formula was called the population standard deviation, and the second was called the sample standard deviation.

(Quick note to the wise, if you wish to sound erudite and wicked smart, then spell the plural of "formula" with an "e" at the end. Spelling it "formulas" is so déclassé. I also recommend spelling "gray" with an e: grey. The Brits just naturally sound smart.)

So, what gives? Why divide by n-1?

Population standard deviation

Below we have the formula for the "population" standard deviation.

Formula for population standard deviation

You subtract the mean from all the samples, and square each difference. Squaring them does two things. First, it makes them all positive. After all, we want to count a negative deviation the same as a positive deviation, right? Second, it gives more weight to the larger deviations. 

Do you really want to give more weight to the larger deviations?  I dunno. Maybe. Depends? Maybe you don't. For some purposes, it might be better to take the absolute value, rather than the square. This leads to a whole 'nother branch of statistics, though. Perfectly valid, but some of of the rules change.

The squares of the deviations from the mean are then added up, and divide by the number of samples. This gives you the average of the squared deviations. That sounds like a useful quantity, but we want to do one more thing. This is an average of the squares, which mean that the units are squared units. If the original data was in meters or cubic millimeters, then the average of the squared deviations is in squared millimeters, or in squared cubic millimeters. So, we take the square root to get us back to the original units.

Sample standard deviation

And then there's the formula for the sample standard deviation. The name "sample" versus "population" gives some indication of the difference between the two types of standard deviation. For a sample standard deviation, you are sampling. You don't have all the data. 

That kinda makes it easy. In the real world, you never have all the data. Well... I guess you could argue that you might have all the data if you did 100% inspection of a production run. Then again, are we looking for the variation in one lot of product, or the variation that the production equipment is capable?  In general, you don't have all the data, so all you can compute is the sample standard deviation.

Formula for the sample standard deviation

Let's look at the other differences. Note that one population formula uses the symbol μ for the mean, and the sample standard deviation uses the symbol x-bar. The first symbol stands for the actual value of the average of all the data. The latter stands for an estimate of the average of all the data.

Estimate of the average?

I have a subtle distinction to make. We are used to thinking that the statistical mean is just a fancy word for "average", but there is a subtle difference. The average (or should I say "an" average) is one estimate of the mean. If I take another collection of data points from the whole set of them (if I sample the population), then I get another estimate of the mean.

One may ask "how good is this estimate? If you take one data point to compute the average (kind of a silly average, since there is only one) then you have no idea how good the average is. But if you have the luxury of taking a bunch of data points, then you have some information about how close the average might be to the mean. I'm, not being very statistical here, but it seems like a good guess that the true mean would lie somewhere between the smallest data point and the largest.

Let's be a bit more precise. If you sample randomly from the population, and if the data doesn't have a RFW distribution, and if you take at least a bunch of points, like ten or twenty, then there is something like a 68% chance that the true mean lies in the interval.

By the way, if you were wondering, the acronym RFW stands for Really GoshDarn Weird.

This is kind of an important result. If you wish to improve the statistical accuracy of your estimate of the mean by, for example, a factor of two, then you need to average four points together.If you want to improve your estimate by a factor of ten, you will need to average 100 data points.

Difference between sample and population standard deviation

Finally, I can state a little more precisely how to decide which formula is correct. It all comes down to how you arrived at your estimate of the mean. If you have the actual mean, then you use the population standard deviation, and divide by n. If you come up with an estimate of the mean based on averaging the data, then you should use the sample standard deviation, and divide by n-1.

Why n-1????  The derivation of that particular number is a bit involved, so I won't explain it. I would of course explain it if I understood it, but it's just too complicated for me. I can, however, motivate the correction a bit.

Let's say you came upon a magic lamp, and got the traditional three wishes. I would guess that most people would use the first couple of wishes on money, power, and sex, or some combination thereof. Or maybe something dumb like good health. But, I am sure most of me readers would opt for something different, like the ability to use a number other than x-bar (the average) in the formula for the sample standard deviation.

You might pick the average, or you might pick a number just a bit smaller, or maybe a lot larger. If you tried a gazillion different numbers, you might find something interesting. That summation thing in the numerator? It is the smallest when you happen to pick the average for x-bar.

This has an important ramification, since x-bar is only an estimate of the true mean. It means that if you estimate the standard deviation using n in the denominator, you are almost guaranteed to have an estimate of the standard deviation that is too low. This means that if we divide by n, we have a bias. The summation will tend to be too low. Dividing by n-1 is just enough to balance out the bias.

So, there is the incomplete answer.

Oh, one more thing... does it make a big difference? If you are computing the standard deviation of 10 points, the standard deviation will be off by around 5%. If you have 100 points, you will be off by 0.5%. When it comes down to it, that error is insignificant.

9 comments:

  1. I've wondered about this question for years, and this is the best explanation I've ever seen. Totally makes sense now. thank you!

    ReplyDelete
  2. I've wondered about this question for a while, and this is the best explanation I've seen. Thank you!

    ReplyDelete
  3. John Seymour that's the explanation one should give to their students!

    ReplyDelete
  4. Best answer. So much more useful than the gibberish from "professors"

    ReplyDelete
  5. That's a great easy to understand explanation. Thanks.

    ReplyDelete
  6. However, what is the logic of using n-1 as the denominator in case of estimation of mean using a sample ?

    ReplyDelete
  7. Logic of n-1 instead of n ?

    ReplyDelete
  8. But I don't get one thing, the sample standard deviation is an estimator of the population standard deviation, this estimate can be either greater or smaller than the population's one but in writing the 'n-1' in the denominator we are assuming this estimate to be lesser(hence dividing by a smaller number), I don't seem to be convinced of this assumption.

    ReplyDelete
    Replies
    1. Good question, Saptashwa. Yes, the equation gives us an estimate of the population standard deviation. And yes, that estimate could bigger or smaller than the actual population standard deviation since we are not including all potential data into the computation. But I claim that this estimate of the population standard deviation (when you divide by n) is biased. In the long run, you will get a number slightly smaller than the actual population standard deviation.

      Why? You need the mean in order to compute the standard deviation. Generally speaking, you use the same data to compute the mean and the standard deviation... but you don't need to.

      Averaging the data does not give you the true population mean, but rather an estimate of it. That estimate of the mean could be smaller or larger than the population mean. If you pull another sample of data points from the same population, you will likely get a different estimate of the mean.

      Here is the kicker... of all possible estimates of the mean, the average of your original data set gives the smallest estimate of the standard deviation. The true mean could be bigger or smaller, but we choose the smallest of them to compute the standard deviation? That gives us a bias.

      Delete