John the Math Guy: How many samples do I need?

Simple question: If I am sampling a production run to verify tolerances, how many production pieces do I need to sample?

It's an easily stated question, and also an important one. Millions of dollars may be at stake if a production run has to be scrapped or if the customer has to be reimbursed for a run that was out of tolerance (so-called "makegoods"). On the other side, the manufacturer may need to spend tens or hundreds of thousands of dollars on equipment to perform inspection.

For certain manufactured goods, 100% compliance is required. The cost of delivering a bad Mercedes, pharmaceutical, or lottery ticket is very high, so pretty much every finished good has to be inspected. But in most cases, the cost of a few bad goods is not that great. If a few cereal boxes burst because of a bad glue seal, or if a page in the Color Scientist Monthly is smeared, but how bad can that be? It's a calculated risk of product waste versus inspection cost.

Now if the foldout featuring the PlayColor of the Month is smeared, that's another story.

I volunteer to do 100% inspection of the taste of these products!

In the world I live in - printing - contracts often stipulate a percentage of product that must be within a given tolerance. This is reflected in the ISO standards. I have pointed out previously that ISO 12647-2 requires 68% of the color control patches within a run be within a specified tolerance. The thought is, if we get 68% of the samples within a "pretty good" tolerance, then 95% will be within a "kinda good" tolerance. All that bell curve kinda stuff.

A press run may have tens of thousands or even millions of impressions. Clearly you don't need to sample all of the control patches in the press run in order to establish the 68%, but how many samples are needed to get a good guess?

Maybe three samples?

Keeping things simple, let's assume that I pull three samples from the run, and measure those. There are four possible outcomes: all three might be in compliance, two of the three might be in compliance, only one may be in compliance, or none of the samples might be in compliance. I'm going to cheat just a tiny bit, and pretend that if two or more of the three pass, then I am in compliance. That's 66.7% versus 68%. It's an example. Ok?

I am also going to assume that random sampling is done, or more accurately, that the sampling is done in such a way that the variations in the samples are independent. Note that pulling three samples in a row almost certainly violates this. Sampling at the end of each batch, roll, or work shift probably also violates this. And at the very least, the samples must be staggered through the run.

Under those assumptions, we can start looking at the likelihood of different outcomes. The table below shows the eight possible outcomes, and the ultimate diagnosis of the production run.

Sample 1	Sample 2	Sample 3	Run diagnosis	Probability
Not so good	Not so good	Not so good	Fail	(1-p)³
Not so good	Not so good	Good	Fail	p (1-p)²
Not so good	Good	Not so good	Fail	p (1-p)²
Not so good	Good	Good	Pass	p² (1-p)
Good	Not so good	Not so good	Fail	p (1-p)²
Good	Not so good	Good	Pass	p² (1-p)
Good	Good	Not so good	Pass	p² (1-p)
Good	Good	Good	Pass	p³

Four of the possibilities show that the run was passed, and four show it failing, but this is not to say that there is a 50% chance of passing. The possible outcomes are not equally likely. It depends on the probability that any particular sample is good. If, for example, the production run were to be overwhelmingly in compliance (as one would hope), the probability that all four samples would come up good is very high.

The right-most column helps us quantify this. If the probability of pulling a good sample is p, then the probability of pulling three good samples is p³. From this, we can quantify the likelihood that we will get at least the requisite two good samples out of three to qualify the production run as good.

Probability of ok-ing the run based on three samples = p² (1-p) + p² (1-p) + p² (1-p) + p³

Things start going bad

What could possibly go wrong? We have proper random sampling, and we have a very official looking formula.

Actually, two different things could go wrong. First off, the production run might be perfectly good, but, by luck of the draw, two or three bad samples were drawn. I'm pretty sure the manufacturer wouldn't like that.

The other thing that could go wrong is that the production run was actually out of tolerance (more than one-third of the pieces were bad), but this time Lady Tyche (the Goddess of Chance) favored the manufacturer. The buyer probably wouldn't like that.

From the formula above, we can plot the outcomes as a function of the true percentage that were in tolerance. The plot conveniently shows the four possibilities: correctly rejected, incorrectly accepted, correctly accepted, and incorrectly accepted.

Outcomes when 3 samples are used to test for conformance

Looking at the plot, we can see if 40% of the widgets in the whole run were in tolerance, then there is a 35.2% chance that the job will be given the thumbs up, and consequently a 64,8% chance of being given the thumbs down as it should. The manufacturers who are substandard will be happy that they still have a fighting chance if the right samples are pulled for testing. This of course is liable to be a bit disconcerting for the folks that buy these products.

But, the good manufacturers will bemoan the fact that even when they do a stellar job of getting 80% of the widgets widgetting properly, there is still a chance of more than 10% that the job will be kicked out.

Just in case you were wondering, the area of the red (representing incorrect decisions) is 22.84%. That seems like a pretty good way to quantify the efficacy of deciding about the run based on three samples.

How about 30 samples?

Three samples does sound a little skimpy -- even for a lazy guy like me. How about 30? The Seymourgraph for 30 samples is shown below. It does look quite a bit better... not quite so much of the bad decision making, especially when it comes to wrongly accepting lousy jobs. Remember the manufacturer who got away with shipping lots that were only 40% in tolerance one in three times? If he is required to sample 30 products to test for compliance, all of a sudden his chance of getting away with this drops way down to 0.3%. Justice has been served!

Outcomes when 30 samples are used to test for conformance

And at the other end, the stellar manufacturer who is producing 80% of the products in tolerance now has only a 2.6% chance of being unfairly accused of shoddy merchandise. That's better, but if I were a stellar manufacturer, I would prefer not to get called out on the carpet once out of 40 jobs. I would look into doing more sampling so I could demonstrate my prowess.

The area of the red curve is now 6.95%, by the way. I'm not real sure what that means. It kinda means that the mistake rate is about 7%, but you gotta be careful. The mistake rate for a particular factory depends on the percentage that are produced to within a tolerance. This 7% mistake rate has to do with the mistake rate for pulling 30 samples over all possible factories.

I am having a hard time getting my head around that, but it still strikes me that this is a decent way to measure the efficacy of pulling 30 samples.

How about 300 samples?

So... thirty samples feels like a lot of samples, especially for a lazy guy like me. I guess if it was part of my job, I could deal with it. But as we saw in the last section, it's probably not quite enough. Misdiagnosing the run 7% of the time sounds a bit harsh.

Let's take it up a notch to 300 samples. The graph, shown below, looks pretty decent. The mis-attributions occur only between about 59% and 72%. One could make the case that, if the condition of the production facility is cutting it that close, then it might not be so bad for them to be called out on the carpet once in a while.

Outcomes when 300 samples are used to test for conformance

Remember looking at the area of the red part of the graph... the rate of mis-attributions? The area was 22.84% when we took 3 samples. It went down to 6.95% with 30 samples. With 300 samples, the mis-attribution rate goes down to 2.17%.

The astute reader may have noticed that each factor of ten increase in the number of samples will decrease the mis-attribution error by a factor of three. In general, one would expect that the mis-attribution rate drops by square root of the number of samples. Multiplying the sampling rate by ten will decrease the mis-attribution rate by the square root of ten, which is about 3.16.

If our goal is to bring the mis-attribution rate down to 1%, we would need to pull about 1,200 samples. While 300 samples is beyond my attention span, 1,200 samples is way beyond my attention span. Someplace in there, the factory needs to consider investing in some automated inspection equipment.

The Squishy Answer

So, how many samples do we need?

That's kind of a personal question... personal in that it requires a bit more knowledge. If the production plant is pretty darn lousy --let's say only 20% of the product within tolerance -- then you don't need many samples to establish the foregone conclusion. Probably more than 3 samples, but the writing is on the wall before 30 samples have been tested. Similarly, if the plant is stellar, and produces product that is in tolerance 99% of the time, then you won't need a whole lot of samples to statistically prove that at least 68% are within tolerance.

Then again, if you actually knew that the plant was producing 20% or 99% of the product in tolerance, then you wouldn't need to do any sampling, anyway. The only reason we are doing sampling is because we don't know.

The question gets a little squishy as you get close to the requested percentage. If your plant is consistently producing 68.1% of the product in tolerance, you would need to do a great deal of sampling to prove to a statistician that the plant was actually meeting the 68% in tolerance quota.

So... you kinda have to consider all possibilities. Go in without any expectations about the goodness of the production plant. Assume that the actual compliance rate could be anything.

The Moral of the Story

If I start with the assumption that the production run could produce anywhere between 0% and 100% of the product in tolerance, and that each of these is equally likely, then if I take around 1,200 samples, I have about a 99% chance of correctly determining if 68% of the run is in tolerance.

If you find yourself balking at that amount of hand checking, then it's high time you looked into some automated testing.

3 comments:

Dave WybleFebruary 15, 2015 at 7:37 AM
This is all bafflegab intended to keep statisticians employed.

Seriously, there are standards out there: ASTM E1345 and an analogous one from SAE. They might not be answering the precise question you are though (re-read my first sentence).
John SeymourFebruary 15, 2015 at 12:11 PM
I am a proud contributor to the gabbleflablocutions that keep statisticians employed.

I admit that I have not read ASTM E1345. I will look into it, thank you.
John SeymourFebruary 15, 2015 at 12:30 PM
Oh... and speaking of gabfabble, I did read the captivating novel "Department of Defense Test Method Standard", MIL-STD-1916, April 1996. I am reluctant to admit that I was well into my second six-pack of Heineken before I started understanding it. (I will need to take another trip to the liquor store to get back to that level of understanding.

Wednesday, February 4, 2015

How many samples do I need?

3 comments: