Wednesday, April 15, 2015

It's in the Rock Stars

I wrote a blog post two years ago on the topic of astrology. I looked at the prevalence of the signs of the zodiac for 189 mathematicians born between 1800 and 1840. Now I would think that mathematicians would tend to share certain personality traits. If there is anything to the standard zodiacal theory (that your sign of the zodiac determines your personality), then mathematicians should tend to have one astrological sign.




My research on mathematicians showed that there was no predominant sign of the zodiac associated with this group of mathematicians. They were pretty much evenly distributed across the zodiac. Note to self: Someone's birth sign can be used to predict their personality, except if they happen to be a mathematician born between 1800 and 1840.

One of the readers of this blog threw down the gauntlet, offering up one group of people who fit with their astrological predictions. Here is the comment: 

Hi
I have noticed that famous musicians that are just as famous for their politics as they are for their music are overwhelmingly Libras.
The only exception is the guy from U2.
Here are some:
John Lennon
Thom Yorke
Bruce Springsteen
There also seem to be an almost total lack of Scorpio musicians unless you count vanilla ice.
Aries musicians tend to make exciting music.

I missed a couple of example:
Sting -Mr Rainforest
Bob Geldof - Feed the world

Most musicians who are famous for their politics seem to be libra. But not all libra musicians are publicly political.

<Posted by Chunkations>

On reading this list, my first thought was ... what about the other Beatles? John Lennon wasn't the only one of them making a peace sign. Then, since I am a child of that era, a long list of other names came to mind: Bob Dylan, Pete Seeger, Peter, Paul, and Mary...

The Experiment

I decided to test Chunkation's hypothesis. I sent a Facebook message to each of my 16 friends, all of whom are into music. Note that I copied the words from the Chunkations'  comment directly. I did not try to spin my own interpretation of Chunkations' words, and I did not tell people what I planned on doing with the list.

I'm asking people for a little help with a future blog post. I am compiling a list of famous musicians that are just as famous for their politics as they are for their music. Could you nominate the five who you think are most qualified?

13 of the friends gave me their list. (To be perfectly honest, one of my friends was me. Another was my wife, who sometimes is my friend.) The combined list included 38 different musicians.

I had to do a little editing. Some people offered more than five. People just can't follow directions, you know? I went back and asked them to pick five from the list, or I just took the first five from their list.

Many people offered up music groups. People just can't follow directions, you know? Rage Against the Machine got three votes! Not sure what to do, I went to Wikipedia and used the name of the first group member that was listed. I tabulated with and without these.

Here is a list of all the musicians, a count of the number of people who mentioned them, and their birthday (according to Wikipedia) and their sign, according to the Zodiac table on Wikipedia.

Artist Count  Bday Sign
Country Joe McDonald 1 1-Jan Capricorn
Joan Baez 4 9-Jan Capricorn
Zack de la Rocha (Rage Against the Machine) 3 12-Jan Capricorn
Bob Marley 2 6-Feb Aquarius
Cheryl Crow 1 11-Feb Aquarius
Buffy Saint-Marie 1 20-Feb Pisces
Harry Belafonte 1 1-Mar Pisces
Miriam Makeba 1 4-Mar Pisces
Dee Snyder 1 15-Mar Pisces
Brian Warfield (The Wolfe Tones) 1 2-Apr Aries
Al Green 1 13-Apr Aries
Barbra Streisand 2 24-Apr Taurus
Willie Nelson 1 29-Apr Taurus
Pete Seeger 4 3-May Taurus
Richie Furay (Buffalo Springfield) 1 9-May Taurus
Bono 5 10-May Taurus
Bob Dylan 4 24-May Gemini
Bruce Cockburn 1 27-May Gemini
Peter Yarrow 1 31-May Gemini
Audra McDonald 1 3-Jul Cancer
Arlo Guthrie 1 10-Jul Cancer
Woodie Guthrie 2 14-Jul Cancer
Serj Tankian 1 21-Aug Leo
Beyonce 1 4-Sep Virgo
Chrissy Hynde 1 7-Sep Virgo
Ani DiFranco 1 23-Sep Libra
Bruce Springsteen 2 23-Sep Libra
Sting 1 2-Oct Libra
John Mellencamp 1 7-Oct Libra
John Lennon 4 9-Oct Libra
Martie Maguire (Dixie Chicks) 1 12-Oct Libra
Nadezhda Tolokonnikova (Pussy Riot) 1 7-Nov Scorpio
Neil Young 1 12-Nov Scorpio
Aaron Copeland 1 14-Nov Scorpio
Kevin Gilbert 1 20-Nov Scorpio
Ted Nugent 4 13-Dec Sagittarius
Eddie Veder (Pearl Jam) 1 23-Dec Capricorn
Odetta 1 31-Dec Capricorn

Here is the count for each of the signs. The first column is the sign. The second column is the count including the music groups, and the third column is the count leaving out Pussy Riot and the other groups.

Capricorn 5 3
Aquarius 2 2
Pisces 4 4
Aries 2 1
Taurus 5 4
Gemini 3 3
Cancer 3 3
Leo 1 1
Virgo 2 2
Libra 6 5
Scorpio 4 3
Sagittarius 1 1

Are the folks in this list "overwhelmingly" Libras? Somehow I picture "overwhelmingly" to mean something like "most of". Like if 30 of the 38 were Libras, then I would be overwhelmed.


How about the softer claim at the end of the blog comment? "Most musicians who are famous for their politics seem to be libra." I think "most" would mean at least 20. Hmmm... there were only 6.

There are more Libras (6) than there are any other group, but that's not saying much, statistically. There are 5 each of Capricorns and Tauruses. Even if we exclude the music groups (where I might have erred) the story is the same.

So, the experiment failed to support even a much weaker version of Chunkation's hypothesis.

Is the data skewed by non-famous people? Hmmm... There are more than a few people on the list that I have to admit I never heard of. How can they be famous?

I came up with the A list of those people who were nominated by more than one person. And I also left out those who were mentioned only by the band name. Ten people were left: Baez, Marley, Streisand, Seeger, Bono, Dylan, Guthrie, Springsteen, Lennon, and Nugent. I count only one Libra in this group. There are three Tauruses, though.

I think this pretty well busts that myth.

Thank you Anne, Betsy, Dave, Doug, Gringo, John, Madelaine, Mike, Paul, Rachel, Steve, Toby, and Tom for volunteering your lists of famous musicians and activists.

In conclusion

What does this say for the theory that your sign of the zodiac predicts stuff about you and your life?

Someone may well post on this blog that 93% of all left-handed cab drivers in Queens are Pisces. (Or maybe they are just pissed?)  I imagine that such a post might end with something like "How can you explain that, Mr. Smarty Math Guy Pants!!??!??"

A theory is useful if it can make predictions. I have given two examples where the birth sign makes absolutely lousy predictions, so it is not useful, and perhaps even harmful. In order for the theory to hope to prove itself useful, it has to include an instruction manual that tells me how to avoid making these two bad predictions. Until I have that manual, I have to assume that the whole theory is not useful.

But there is an odd thing about Springsteen...

I noticed one interesting thing comparing Chunkations' list with my own list. He/she listed Springsteen as a Libra. I initially counted my pal Bruce as a Virgo. I double checked Wikipedia's entry on Zodiac for the date range for the signs, and saw that I didn't mess up. It lists Virgo as August 23rd to September 23rd, and the Boss was born on Sept. 23. By that definition, I was correct.


But then I looked further... The Wikipedia page for Virgo tells a slightly different story: "[T]he Sun transits this area on average between August 23 and September 22". By this definition, Springsteen falls into the Libra camp along with Mellencamp.

So I started furiously googling. I found 21 website that gave a definition of the date range for Libra, and there were four different date ranges associated with Libra. There was one website (I won't tell which one) that contradicted itself about the date range for Libra. On the same page.

Why the confusion? The transitions between the signs of the zodiac are defined based on the position of the Sun and Earth with respect to one another and the stars, and also on what part of the globe you were born on. Springsteen (and Ani DiFranco) were born on a cusp, the edge between two birth signs. They could go either way.

Incidentally, when I did my analysis before, the count of six Libras included Springsteen and DiFranco. With a slightly different definition of Libra, the count of correct guesses would have been only four.  

The hidden meaning of "What's your sine?"

Time for a little math history trivia. I thought my little pun on "What's my sine?" was cute. But it later occurred to me that there was another little twist.

Long ago, people invented astrology as a way to help explain the world and make predictions about whether to go to war or just throw a big block party. This perceived need fostered early astronomers to take detailed notes on the positions of the moon, the stars and the planets.

The astronomers had all this data, but they needed something to do with it. They needed to develop math to be able to predict the future positions of the celestial bodies. Consequently, trigonometry was invented. And it was invented almost a millennium before algebra. So there is another connection between the words "sign" and "sine".

Wednesday, April 1, 2015

Homeopathic exercise

The Australian Nation Health and Medical Research Council recently released a report on the efficacy of homeopathy. There findings were pretty simple and straightforward. There ain't no efficacy. They reviewed all of the studies that have been done, and came to the conclusion: 

Based on the assessment of the evidence of effectiveness of homeopathy, NHMRC concludes that there are no health conditions for which there is reliable evidence that homeopathy is effective.


I am disappointed that the report does not reference my own scholarly meta-analysis of the research on homeopathy, but... my research is fairly recent, and so far the homeopaths on the board of Pubmed have successfully blocked the inclusion of my blog post. I am heartened, however, that the Australian study concurs with my own analysis.

In the wake of the total annihilation of homeopathy, I think the time is ripe for an alternative to alternative medicine; something preventative, that will prevent the need for the use of alternative medicine. 

Homeopathic exercise.

It even sounds impressive, doesn't it? 

Homeopathy is based on the premise that a little bit of that which hurts you is able to cure you. If a poison gives you a headache, then a tiny tiny tiny tiny amount of it will cure your headache, regardless of its cause. In some cases, the poison/medicine is diluted to the point where not one single molecule remains - just a memory of that molecule.

The really good stuff is diluted like this 30 times

In hindsight, the application to exercising is obvious. Exercise hurts, doesn't it? So obviously, we can avoid all the adverse effects of exercise (muscle pain, nausea, shortness of breath, profuse sweating, and body odor) by prescribing minute amounts of exercise.

You can even do homeopathic exercise at work!

I'm talking stuff like opening and closing eyelids, yawning a few times, scratching your butt, and maybe even scratching someone else's butt. Sitting in front of the TV is good, especially when accompanied by seven ounce curls - twelve ounce curls for the advanced homeopathic exerciser. A micro-brew would be good perfect for this. A nano-brew or femto-brew would be excellent, but who can find them?

Reruns of Star Trek (the one with Captain Kirk?) are a great way to get the old heart pumping. Personally, I am avoiding any episodes with Yeoman Rand. In laboratory tests, I have seen that just imagining being in the transporter room with her can raise my pulse by a few tenths of a beat per minute. This is just too much. Could be very dangerous. And don't even think about an I Dream of Jeannie episode!

Careful not to overdose!

Lately, I have been experimenting by treatment with just a memory of exercise. Yesterday, I tried thinking about sixth grade gym class, where we were required to do sit-ups and pull-ups and throw-ups. I felt better immediately.


I'm still working the kinks out on my regimen. Let me know how it works for you. Happy April 1st!  

Friday, February 27, 2015

What color is the dress??!??!?!?

Enough people have asked me to adjudicate this question that I really have to do an emergency post.

Unless you are living in a dark cave you have probably heard the furor. What color is the dress????


It seems that 75% see it as white and gold and 25% see it as blue and gold.


I have learned to avoid this type of question. I have learned in my marriage that if I say “Wow! Look at that really gorgeous woman in the turquoise dress! She’s really cute, and I would like to ask her out – maybe take her to Bermuda for the weekend.” this will start an argument with my wife about the color of the dress!

But clearly I need to weigh in on this. It comes down to how we define color.

What is "color"?

Definition #1: Color is something computed from the spectrum of light that comes off an object.

If this were true, a piece of copy paper would be brilliant white in the sunlight and a very very dark brown under incandescent light. This phenomenon is explained by the fact that the cones are autoranging.

Definition #2: Color is defined by the signals collected by the cones after this auto-ranging.

This optical illusion proves that to be wrong: http://en.wikipedia.org/wiki/Checker_shadow_illusion. The cones are collecting the same signals for A and B in that illusion, but the neurons in the eye are comparing the pixels with their neighbors. “This pixel is bluer than the one next to it, so I am going to call it “a little bit blue”.

Definition #3: Color is defined by the signals leaving the eye.

This doesn’t quite explain the checker illusion. The lower part of the brain pulls a lot of tricks in an attempt to interpret a visual scene. In the checker example, the lower brain interprets each of the squares of the checkerboard as an object, and simplifies things by saying that the variation in signal intensity is due to shading, and not due to any properties of the dress.

Definition #4: Color is defined by the interpretation provided by the lower brain, after segmenting the scene into distinct objects.

Still not there yet. The thing that is (likely) the point of confusion in the dress image is that the brain actively seeks a white point from which to judge color. Look at a newspaper. What color is it? White. Now lay a piece of white ultra bright copy paper next to it. What color is the newspaper now? It turned dingy. Your brain first used the newspaper as a white point, and made its color assessment  based on that. When the copy paper was introduced, your brain picked up a different definition of white to compare things to.

In the dress picture, you will note that the upper right portion is saturated. This is confusing to the brain, since the autorange in the eyeball doesn’t normally allow things to saturate. Our eye would scale this so that we could see the bright area, and we would not be able to tell the color of the dress.

How does the brain interpret the saturation that happened in the camera? Does it see that as the white point and assess from there? Or does it come up with another brilliant (pun intended) explanation, and set it’s white point to something beyond 255, 255, 255? This is a guess, but I think that different brains might set different white points.


Objects don’t have an inherent color. Color is a subtle interplay between the light hitting an object, the light reflected from the object, the spectral response and autoranging of the cones, the low level segmentation into distinct objects, and the interpretation of white point.

So what’s my answer? The question is a silly question. Dresses do not have any inherent color. The better question is “what color do you see when you look at the dress?”  That question apparently depends on the viewer. Color is in the eye brain of the beholder.


Wednesday, February 18, 2015

How many samples do I need? (with lots of assumptions)

In an earlier post, I looked at the question "How many samples of a production run do I need to assure that 68% are within tolerance?" I concluded with "at least 300, and preferably 1,200. In the first pass I made only one assumption - that the sampling of parts to test was done at random. I answered the question with no information about the metric or about the process. 

For my answer, I reduced the question down to a simple one. Suppose that a green M&M is put into a bucket/barrel/swimming pool whenever a good part is produced, and that a red M&M is put in whenever a bad part is produced. For the testing of this production run, M&Ms are pulled out at random, and noted as being either good or bad. After tallying the color of each M&M, they are replaced into the bucket/barrel/swimming pool. 


Note the assumptions. I assume that the production run is sampled at run with replacement. And that's about it. 

Statistics of color difference values

Today I answer the question with a whole lot of additional assumptions. Today I assume that the metric being measured and graded is the color difference value, in units of ΔE. And I make some assumptions about the statistical nature of color differences.

I assembled real-world data to create an archetypal cumulative probability density function (CPDF) of color difference data from a collection of 262 color difference data sets each with 300 to 1,600 data points. In total, my result is a distillation of  317,667 color differences from 201 different print devices, including web offset, coldset, flexo, and ink jet printers. So, a lot of data was reduced to a set of 101 percentile points shown in the image below. Note that this curve has been normalized to have a median value of 1.0 ΔE, on the assumption that all the curves have the same shape, but differ in scale.

Archetypal cumulative probability density function for color difference data (ΔEab)

For my analysis, it is assumed that all color difference data has this same shape. Note that if one has a data set of color difference data, it can be transformed to relate to this archetype by dividing all the color difference values by the median of the data set. In my analysis of the 262 data sets, this may not of been an excellent assumption, but then again, it was not a bad assumption.

The archetypal curve is based on data from printing test targets each with of hundreds of CMYK values, and not from production runs of 10,000 copies of a single CMYK patch. For this analysis, I make the assumption that run-time color differences behave kinda the same. I've seen data from a couple three press runs. I dunno, might not be such a good assumption.

Let's see... are there any other assumptions that I am making today? Oh yeah... I have based the archetypal CPDF on color difference data based on the original 1976 ΔE formula and not the 2000. Today, I don't know how much of a difference this makes. Some day, I might know.

Monte Carlo simulation of press runs

I did some Monte Carlo simulations with all the aforementioned assumptions. I was asking a variation on the question asked in the previous blog. Instead of asking what the how many samples were needed to make a reliable pass/fail call, I asked how many samples were needed to get a reliable estimate of the 68th percentile. Subtle difference, but that's the nature of statistics.

As in the previous blog, I will start with the example of the printer who pulls only three samples and from these three, determines the 68th percentile. I'm not sure just how you get a 68th percentile from only three samples, but somehow when I use the PERCENTILE function in Excel or the Quantile function in Mathematica, they give me a number. I assume that the number means something reasonable.

Now for a couple more assumptions. I will assume that the tolerance threshold is 4 ΔE (in other words, 68% must be less than 4 ΔE), and that the printer is doing a pretty decent job of holding this - 68% of the samples are below 3.5 ΔE. One would hope that the printer gets the thumbs up on the job almost all the time, right?

Gosh, that would be nice, but my Monte Carlo simulation says that this just ain't gonna happen. I ran the test 10,000 times. Each time, I drew three random samples from the archetypal CPDF shown above. From those, I calculated a 68th percentile. The histogram below shows the distribution of the 68th percentiles determined this way. Nearly 55% of the press runs were declared out of tolerance.

Distribution of estimates for the 68th percentile, determined from 3 random samples

There is something just a tad confusing here. The assumption was that the press runs had a 68th percentile of 3.5 ΔE. Wouldn't you expect that at least 50% of the runs were in tolerance? Yes, I think you might, but note two things: First, the distribution above is not symmetrical. Second, as I said before, determining the 68th percentile of a set of three data points is a bit of a slippery animal.

When this printer saw how many were failing, he asked for my advice. I pointed him to my previous blog, and he said "1200?!?!?  Are you kidding me!?!?  I can't even afford to measure 300 samples!" He ignored me, and never paid me my $10,000 consulting fee, but I heard through the grapevine that he did start pulling 30 samples. That's why I get paid the big bucks. So people can ignore my advice. 

Distribution of estimates for the 68th percentile, determined from 30 random samples

The image above shows what happened when he started measuring the color error on 30 samples per press run. Much better. Now only about 13% of the press runs are erroneously labelled "bad product". What happened after that depended on how sharp the teeth were in the contract between the printer and print buyer. Maybe the print buyer just shrugged it off when one out of every 8 print runs were declared out of tolerance? Maybe there's a lawsuit pending? I don't know. That particular printer never called me up with a status report.

What if the printer had heeded my advice and started pulling 300 samples to determine the 68th percentile? The results from one last Monte Carlo experiment are shown below. Here the printer pulled all 300 samples that I asked for. At the end of 10,000 press runs, the printer had only three examples where a good press run was called "bad". 

Distribution of estimates for the 68th percentile, determined from 300 random samples

Print buyer's perspective

The previous examples were from the printer's perspective, where the printer responds with self-righteous indignation when sadistical control process has the gall to say that a good run is bad. We now turn this around and look at the print buyer's perspective.

Let's say that a printer is doing work that is not up to snuff... I dunno... let's say that the 68th percentile is at 4.5 ΔE. If the print buyer is a forgiving sort, then maybe this is OK by him. But then again, maybe his wife might tell him to stop being such a door mat?  (I am married to a woman who tells her spouse that all the time, especially when it comes to clients not paying.) We can't simulate what this print buyer's wife will tell him, but we can simulate how often statistical process control will erroneously tell him that a 4.5 ΔE run was good.

The results are similar, as I guess we would expect. If your vision of "statistical process control" means three samples, then 21.1% of the bad jobs will be given the rubber stamp of approval. The printer may like that, but I don't think the print buyer's spouse will stand for it.

If you up the sampling to 10 samples, quite paradoxically, the rate of mis-attribution goes up to 35.7%. That darn skewed distribution.

Pulling thirty samples doesn't help a great deal either. With 30 samples, the erroneous use of the "approved" stamp goes down only to 15.7%. If the count is increased to 100, then about 4.7% of the bad runs are called "good". But when 300 samples are pulled, the number drops way down to 0.06%.

Conclusions

I ran the simulation with a number of different sample sizes and a number of different underlying levels of "quality of production run".  The results are below. The percentages are the probability of making a wrong decision. In the first three lines of the table (3.0 ΔE to 3.75 ΔE), this is the chance that a good job will be called bad. In the next three lines of the table, this is the chance that a bad job will be called good.

Actual 68th
N = 3
N = 10
N = 30
N = 100
N = 300
3.0 ΔE
37.0%
4.0%
0.6%
0.0%
0.0%
3.5 ΔE
54.6%
18.1%
12.9%
1.5%
0.0%
3.75 ΔE
61.1%
29.0%
30.2%
13.2%
4.8%
4.25 ΔE
25.9%
47.5%
29.7%
20.9%
5.1%
4.5 ΔE
21.1%
35.7%
15.7%
4.7%
0.1%
5.0 ΔE
13.1%
19.6%
2.9%
0.0%
0.0%

Calculation of this table is a job for an applied math guy. Interpreting the table is a job for a statistician, which is at the edge of my competence. Deciding how to use this table is beyond my pay grade. It depends on how comfortable you are with the various outcomes. If, as a printer, you are confident that your process has a 68th percentile of 3.0 ΔE or less, then 30 samples should prove that point. And if your process slips a bit to the 3.5 ΔE level, and you are cool with having one out of eight of these jobs recalled, then don't let no one talk you into more than 30 samples. If you don't want those jobs recalled though...

If, as a print buyer, you really have no intention of cracking down on a printer until they hit the 5 ΔE mark, then you may be content with 30 samples. But if you want to have some teeth in the contract when a printer goes over 4.5 ΔE, then you need to demand at least 100 samples.

First addendum

You will note that my answer was a little different than the previous blog post where I made minimal assumptions. If I make all the assumptions that are in this analysis, then the number of samples required (to demonstrate that 68% of the colors are within a threshold color difference) is smaller than the previous blog might  have suggested. Note that If one has a data set of color difference data, it can be transformed to relate to this archetype by dividing all the color difference values by the median of the data set. Then again, that one word ("assume", and its derivatives) in bold print has appears on this page 22 times...

Second addendum

In the first section, I mentioned "sampling with replacement", which means that you might sample a given product twice. Kind of a waste of time, really. Especially for small production runs, where the likelihood of duplicated effort is larger. Taken to the extreme, my conclusion was clearly absurd. Do I really need to pull 300 samples for my run of 50 units?!!?!?!

Well, no. Clearly one would sample a production run without replacement. But, in my world, a production run of 10,000 units is on the small side, so I admit to the myopic vision. For the purposes of this discussion, if the production run is over 10,000, it doesn't matter a whole lot whether a few of the 1,200 samples are measured twice. 

Wednesday, February 4, 2015

How many samples do I need?

Simple question:  If I am sampling a production run to verify tolerances, how many production pieces do I need to sample?

It's an easily stated question, and also an important one. Millions of dollars may be at stake if a production run has to be scrapped or if the customer has to be reimbursed for a run that was out of tolerance (so-called "makegoods"). On the other side, the manufacturer may need to spend tens or hundreds of thousands of dollars on equipment to perform inspection.

For certain manufactured goods, 100% compliance is required. The cost of delivering a bad Mercedes, pharmaceutical, or lottery ticket is very high, so pretty much every finished good has to be inspected. But in most cases, the cost of a few bad goods is not that great. If a few cereal boxes burst because of a bad glue seal, or if a page in the Color Scientist Monthly is smeared, but how bad can that be?  It's a calculated risk of product waste versus inspection cost.

Now if the foldout featuring the PlayColor of the Month is smeared, that's another story.

I volunteer to do 100% inspection of the taste of these products!

In the world I live in - printing - contracts often stipulate a percentage of product that must be within a given tolerance. This is reflected in the ISO standards. I have pointed out previously that ISO 12647-2 requires 68% of the color control patches within a run be within a specified tolerance. The thought is, if we get 68% of the samples within a "pretty good" tolerance, then 95% will be within a "kinda good" tolerance. All that bell curve kinda stuff.

A press run may have tens of thousands or even millions of impressions. Clearly you don't need to sample all of the control patches in the press run in order to establish the 68%, but how many samples are needed to get a good guess?

Maybe three samples?

Keeping things simple, let's assume that I pull three samples from the run, and measure those. There are four possible outcomes: all three might be in compliance, two of the three might be in compliance, only one may be in compliance, or none of the samples might be in compliance. I'm going to cheat just a tiny bit, and pretend that if two or more of the three pass, then I am in compliance. That's 66.7% versus 68%. It's an example. Ok?

I am also going to assume that random sampling is done, or more accurately, that the sampling is done in such a way that the variations in the samples are independent. Note that pulling three samples in a row almost certainly violates this. Sampling at the end of each batch, roll, or work shift probably also violates this. And at the very least, the samples must be staggered through the run. 

Under those assumptions, we can start looking at the likelihood of different outcomes. The table below shows the eight possible outcomes, and the ultimate diagnosis of the production run. 

Sample 1
Sample 2
Sample 3
Run diagnosis
Probability
Not so good
Not so good
Not so good
Fail
(1-p)3
Not so good
Not so good
Good
Fail
p (1-p)2
Not so good
Good
Not so good
Fail
p (1-p)2
Not so good
Good
Good
Pass
p2 (1-p)
Good
Not so good
Not so good
Fail
p (1-p)2
Good
Not so good
Good
Pass
p2 (1-p)
Good
Good
Not so good
Pass
p2 (1-p)
Good
Good
Good
Pass
p3

Four of the possibilities show that the run was passed, and four show it failing, but this is not to say that there is a 50% chance of passing. The possible outcomes are not equally likely. It depends on the probability that any particular sample is good. If, for example, the production run were to be overwhelmingly in compliance (as one would hope), the probability that all four samples would come up good is very high.

The right-most column helps us quantify this. If the probability of pulling a good sample is p, then the probability of pulling three good samples is p3. From this, we can quantify the likelihood that we will get at least the requisite two good samples out of three to qualify the production run as good.

     Probability of ok-ing the run based on three samples = p2 (1-p) + p2 (1-p) + p2 (1-p) + p3

Things start going bad

What could possibly go wrong?  We have proper random sampling, and we have a very official looking formula.

Actually, two different things could go wrong. First off, the production run might be perfectly good, but, by luck of the draw, two or three bad samples were drawn. I'm pretty sure the manufacturer wouldn't like that. 

The other thing that could go wrong is that the production run was actually out of tolerance (more than one-third of the pieces were bad), but this time Lady Tyche (the Goddess of Chance) favored the manufacturer. The buyer probably wouldn't like that.

From the formula above, we can plot the outcomes as a function of the true percentage that were in tolerance. The plot conveniently shows the four possibilities: correctly rejected, incorrectly accepted, correctly accepted, and incorrectly accepted. 

Outcomes when 3 samples are used to test for conformance

Looking at the plot, we can see if 40% of the widgets in the whole run were in tolerance, then there is a 35.2% chance that the job will be given the thumbs up, and consequently a 64,8% chance of being given the thumbs down as it should. The manufacturers who are substandard will be happy that they still have a fighting chance if the right samples are pulled for testing. This of course is liable to be a bit disconcerting for the folks that buy these products.

But, the good manufacturers will bemoan the fact that even when they do a stellar job of getting 80% of the widgets widgetting properly, there is still a chance of more than 10% that the job will be kicked out.

Just in case you were wondering, the area of the red (representing incorrect decisions) is 22.84%. That seems like a pretty good way to quantify the efficacy of deciding about the run based on three samples.

How about 30 samples?

Three samples does sound a little skimpy -- even for a lazy guy like me. How about 30? The Seymourgraph for 30 samples is shown below. It does look quite a bit better... not quite so much of the bad decision making, especially when it comes to wrongly accepting lousy jobs. Remember the manufacturer who got away with shipping lots that were only 40% in tolerance one in three times? If he is required to sample 30 products to test for compliance, all of a sudden his chance of getting away with this drops way down to 0.3%. Justice has been served!

Outcomes when 30 samples are used to test for conformance

And at the other end, the stellar manufacturer who is producing 80% of the products in tolerance now has only a 2.6% chance of being unfairly accused of shoddy merchandise. That's better, but if I were a stellar manufacturer, I would prefer not to get called out on the carpet once out of 40 jobs. I would look into doing more sampling so I could demonstrate my prowess.

The area of the red curve is now 6.95%, by the way. I'm not real sure what that means. It kinda means that the mistake rate is about 7%, but you gotta be careful. The mistake rate for a particular factory depends on the percentage that are produced to within a tolerance. This 7% mistake rate has to do with the mistake rate for pulling 30 samples over all possible factories. 

I am having a hard time getting my head around that, but it still strikes me that this is a decent way to measure the efficacy of pulling 30 samples.

How about 300 samples?

So... thirty samples feels like a lot of samples, especially for a lazy guy like me. I guess if it was part of my job, I could deal with it. But as we saw in the last section, it's probably not quite enough. Misdiagnosing the run 7% of the time sounds a bit harsh. 

Let's take it up a notch to 300 samples. The graph, shown below, looks pretty decent. The mis-attributions occur only between about 59% and 72%. One could make the case that, if the condition of the production facility is cutting it that close, then it might not be so bad for them to be called out on the carpet once in a while.

Outcomes when 300 samples are used to test for conformance

Remember looking at the area of the red part of the graph... the rate of mis-attributions?  The area was 22.84% when we took 3 samples. It went down to 6.95% with 30 samples. With 300 samples, the mis-attribution rate goes down to 2.17%. 

The astute reader may have noticed that each factor of ten increase in the number of samples will decrease the mis-attribution error by a factor of three. In general, one would expect that the mis-attribution rate drops by square root of the number of samples. Multiplying the sampling rate by ten will decrease the mis-attribution rate by the square root of ten, which is about 3.16.

If our goal is to bring the mis-attribution rate down to 1%, we would need to pull about 1,200 samples. While 300 samples is beyond my attention span, 1,200 samples is way beyond my attention span. Someplace in there, the factory needs to consider investing in some automated inspection equipment.

The Squishy Answer

So, how many samples do we need?

That's kind of a personal question... personal in that it requires a bit more knowledge. If the production plant is pretty darn lousy --let's say only 20% of the product within tolerance -- then you don't need many samples to establish the foregone conclusion. Probably more than 3 samples, but the writing is on the wall before 30 samples have been tested. Similarly, if the plant is stellar, and produces product that is in tolerance 99% of the time, then you won't need a whole lot of samples to statistically prove that at least 68% are within tolerance.

Then again, if you actually knew that the plant was producing 20% or 99% of the product in tolerance, then you wouldn't need to do any sampling, anyway. The only reason we are doing sampling is because we don't know.

The question gets a little squishy as you get close to the requested percentage. If your plant is consistently producing 68.1% of the product in tolerance, you would need to do a great deal of sampling to prove to a statistician that the plant was actually meeting the 68% in tolerance quota.

So... you kinda have to consider all possibilities. Go in without any expectations about the goodness of the production plant. Assume that the actual compliance rate could be anything.

The Moral of the Story 

If I start with the assumption that the production run could produce anywhere between 0% and 100% of the product in tolerance, and that each of these is equally likely, then if I take around 1,200 samples, I have about a 99% chance of correctly determining if 68% of the run is in tolerance. 

If you find yourself balking at that amount of hand checking, then it's high time you looked into some automated testing.