Showing posts with label statistics. Show all posts

Wednesday, February 4, 2015

How many samples do I need?

Simple question: If I am sampling a production run to verify tolerances, how many production pieces do I need to sample?

It's an easily stated question, and also an important one. Millions of dollars may be at stake if a production run has to be scrapped or if the customer has to be reimbursed for a run that was out of tolerance (so-called "makegoods"). On the other side, the manufacturer may need to spend tens or hundreds of thousands of dollars on equipment to perform inspection.

For certain manufactured goods, 100% compliance is required. The cost of delivering a bad Mercedes, pharmaceutical, or lottery ticket is very high, so pretty much every finished good has to be inspected. But in most cases, the cost of a few bad goods is not that great. If a few cereal boxes burst because of a bad glue seal, or if a page in the Color Scientist Monthly is smeared, but how bad can that be? It's a calculated risk of product waste versus inspection cost.

Now if the foldout featuring the PlayColor of the Month is smeared, that's another story.

I volunteer to do 100% inspection of the taste of these products!

In the world I live in - printing - contracts often stipulate a percentage of product that must be within a given tolerance. This is reflected in the ISO standards. I have pointed out previously that ISO 12647-2 requires 68% of the color control patches within a run be within a specified tolerance. The thought is, if we get 68% of the samples within a "pretty good" tolerance, then 95% will be within a "kinda good" tolerance. All that bell curve kinda stuff.

A press run may have tens of thousands or even millions of impressions. Clearly you don't need to sample all of the control patches in the press run in order to establish the 68%, but how many samples are needed to get a good guess?

Maybe three samples?

Keeping things simple, let's assume that I pull three samples from the run, and measure those. There are four possible outcomes: all three might be in compliance, two of the three might be in compliance, only one may be in compliance, or none of the samples might be in compliance. I'm going to cheat just a tiny bit, and pretend that if two or more of the three pass, then I am in compliance. That's 66.7% versus 68%. It's an example. Ok?

I am also going to assume that random sampling is done, or more accurately, that the sampling is done in such a way that the variations in the samples are independent. Note that pulling three samples in a row almost certainly violates this. Sampling at the end of each batch, roll, or work shift probably also violates this. And at the very least, the samples must be staggered through the run.

Under those assumptions, we can start looking at the likelihood of different outcomes. The table below shows the eight possible outcomes, and the ultimate diagnosis of the production run.

Sample 1	Sample 2	Sample 3	Run diagnosis	Probability
Not so good	Not so good	Not so good	Fail	(1-p)³
Not so good	Not so good	Good	Fail	p (1-p)²
Not so good	Good	Not so good	Fail	p (1-p)²
Not so good	Good	Good	Pass	p² (1-p)
Good	Not so good	Not so good	Fail	p (1-p)²
Good	Not so good	Good	Pass	p² (1-p)
Good	Good	Not so good	Pass	p² (1-p)
Good	Good	Good	Pass	p³

Four of the possibilities show that the run was passed, and four show it failing, but this is not to say that there is a 50% chance of passing. The possible outcomes are not equally likely. It depends on the probability that any particular sample is good. If, for example, the production run were to be overwhelmingly in compliance (as one would hope), the probability that all four samples would come up good is very high.

The right-most column helps us quantify this. If the probability of pulling a good sample is p, then the probability of pulling three good samples is p³. From this, we can quantify the likelihood that we will get at least the requisite two good samples out of three to qualify the production run as good.

Probability of ok-ing the run based on three samples = p² (1-p) + p² (1-p) + p² (1-p) + p³

Things start going bad

What could possibly go wrong? We have proper random sampling, and we have a very official looking formula.

Actually, two different things could go wrong. First off, the production run might be perfectly good, but, by luck of the draw, two or three bad samples were drawn. I'm pretty sure the manufacturer wouldn't like that.

The other thing that could go wrong is that the production run was actually out of tolerance (more than one-third of the pieces were bad), but this time Lady Tyche (the Goddess of Chance) favored the manufacturer. The buyer probably wouldn't like that.

From the formula above, we can plot the outcomes as a function of the true percentage that were in tolerance. The plot conveniently shows the four possibilities: correctly rejected, incorrectly accepted, correctly accepted, and incorrectly accepted.

Outcomes when 3 samples are used to test for conformance

Looking at the plot, we can see if 40% of the widgets in the whole run were in tolerance, then there is a 35.2% chance that the job will be given the thumbs up, and consequently a 64,8% chance of being given the thumbs down as it should. The manufacturers who are substandard will be happy that they still have a fighting chance if the right samples are pulled for testing. This of course is liable to be a bit disconcerting for the folks that buy these products.

But, the good manufacturers will bemoan the fact that even when they do a stellar job of getting 80% of the widgets widgetting properly, there is still a chance of more than 10% that the job will be kicked out.

Just in case you were wondering, the area of the red (representing incorrect decisions) is 22.84%. That seems like a pretty good way to quantify the efficacy of deciding about the run based on three samples.

How about 30 samples?

Three samples does sound a little skimpy -- even for a lazy guy like me. How about 30? The Seymourgraph for 30 samples is shown below. It does look quite a bit better... not quite so much of the bad decision making, especially when it comes to wrongly accepting lousy jobs. Remember the manufacturer who got away with shipping lots that were only 40% in tolerance one in three times? If he is required to sample 30 products to test for compliance, all of a sudden his chance of getting away with this drops way down to 0.3%. Justice has been served!

Outcomes when 30 samples are used to test for conformance

And at the other end, the stellar manufacturer who is producing 80% of the products in tolerance now has only a 2.6% chance of being unfairly accused of shoddy merchandise. That's better, but if I were a stellar manufacturer, I would prefer not to get called out on the carpet once out of 40 jobs. I would look into doing more sampling so I could demonstrate my prowess.

The area of the red curve is now 6.95%, by the way. I'm not real sure what that means. It kinda means that the mistake rate is about 7%, but you gotta be careful. The mistake rate for a particular factory depends on the percentage that are produced to within a tolerance. This 7% mistake rate has to do with the mistake rate for pulling 30 samples over all possible factories.

I am having a hard time getting my head around that, but it still strikes me that this is a decent way to measure the efficacy of pulling 30 samples.

How about 300 samples?

So... thirty samples feels like a lot of samples, especially for a lazy guy like me. I guess if it was part of my job, I could deal with it. But as we saw in the last section, it's probably not quite enough. Misdiagnosing the run 7% of the time sounds a bit harsh.

Let's take it up a notch to 300 samples. The graph, shown below, looks pretty decent. The mis-attributions occur only between about 59% and 72%. One could make the case that, if the condition of the production facility is cutting it that close, then it might not be so bad for them to be called out on the carpet once in a while.

Outcomes when 300 samples are used to test for conformance

Remember looking at the area of the red part of the graph... the rate of mis-attributions? The area was 22.84% when we took 3 samples. It went down to 6.95% with 30 samples. With 300 samples, the mis-attribution rate goes down to 2.17%.

The astute reader may have noticed that each factor of ten increase in the number of samples will decrease the mis-attribution error by a factor of three. In general, one would expect that the mis-attribution rate drops by square root of the number of samples. Multiplying the sampling rate by ten will decrease the mis-attribution rate by the square root of ten, which is about 3.16.

If our goal is to bring the mis-attribution rate down to 1%, we would need to pull about 1,200 samples. While 300 samples is beyond my attention span, 1,200 samples is way beyond my attention span. Someplace in there, the factory needs to consider investing in some automated inspection equipment.

The Squishy Answer

So, how many samples do we need?

That's kind of a personal question... personal in that it requires a bit more knowledge. If the production plant is pretty darn lousy --let's say only 20% of the product within tolerance -- then you don't need many samples to establish the foregone conclusion. Probably more than 3 samples, but the writing is on the wall before 30 samples have been tested. Similarly, if the plant is stellar, and produces product that is in tolerance 99% of the time, then you won't need a whole lot of samples to statistically prove that at least 68% are within tolerance.

Then again, if you actually knew that the plant was producing 20% or 99% of the product in tolerance, then you wouldn't need to do any sampling, anyway. The only reason we are doing sampling is because we don't know.

The question gets a little squishy as you get close to the requested percentage. If your plant is consistently producing 68.1% of the product in tolerance, you would need to do a great deal of sampling to prove to a statistician that the plant was actually meeting the 68% in tolerance quota.

So... you kinda have to consider all possibilities. Go in without any expectations about the goodness of the production plant. Assume that the actual compliance rate could be anything.

The Moral of the Story

If I start with the assumption that the production run could produce anywhere between 0% and 100% of the product in tolerance, and that each of these is equally likely, then if I take around 1,200 samples, I have about a 99% chance of correctly determining if 68% of the run is in tolerance.

If you find yourself balking at that amount of hand checking, then it's high time you looked into some automated testing.

Wednesday, November 14, 2012

Assessing color difference data

The punch line

For those who are too impatient to wade through the convoluted perambulations of a slightly senile math guy, and for those who already understand the setting of this problem, I will cut to the punch line. When looking at a collection of color difference data (ΔE values), it makes no difference whether you look at the median, the 68th, 90th, 95th, or 99th percentile. You can do a pretty darn good job of describing the statistical distribution with just one number. The maximum color difference, on the other hand, is in a class by itself.

Cool looking function that has something to do with this blog post, copied,
without so much as even dropping an email, from my friend Steve Viggiano

In the next section of this techno-blog, I explain what that all means to those not familiar with statistical analysis of color data. I give permission to those who are print and color savvy to skip to the final section. In this aforementioned final section, I describe an experiment that provides rather compelling evidence for the punch line that I started out with.

Hypothetical situation

Suppose for the moment that you are in charge of QC for a printing plant, or that you are a print buyer who is interested in making sure the proper color is delivered. Given my readership, I would expect that this might not be all that hard for some of you to imagine.

If you are in either of those positions, you are probably familiar with the phrase "ΔE", pronounced "delta E". You probably understand that this is a measurement of the difference between two colors, and that 1 ΔE is pretty small, and that 10 ΔE is kinda big. If you happen to be a color scientist, you probably understand that ΔE is a measurement of the difference between two colors, and that 1 ΔE is (usually) pretty small, and that 10 ΔE is (usually) kinda big [1].

Color difference example copied,
without so much as even dropping him an email, from my friend Dimitri Pluomidis

When a printer tries valiantly to prove his or her printing prowess to the print buyer, they will often print a special test form called a "test target". This test target will have some big number of color patches that span the gamut of colors that can be printed. There might be 1,617 patches, or maybe 928... it depends on the test target. Each of these patches in the test target has a target color value [2], so each of these printed patches has a color error that can be ascribed to it, each color error (ΔE) describing just how close the printed color is to reaching the target color.

An IT8 target

This test target serves to demonstrate that the printer is capable of producing the required colors, at least once. For day-to-day work, the printer may use a much smaller collection of patches (somewhere between 8 and 30) to demonstrate continued compliance to the target colors. These can be measured through the run. For an 8 hour shift, there might be on the order of 100,000 measurements. Each of these measurements could have a ΔE associated with it.

If the printer and the print buyer have a huge amount of time on their hands because they don't have Twitter accounts [3], they might well fancy having a look at all the thousands of numbers, just to make sure that everything is copacetic. But I would guess that if the printers and print buyers have that kind of time on their hands, they might prefer watching reruns of Andy Griffith on YouTube, doing shots of tequila whenever Opie calls his father "paw".

But I think that both the printer and the print buyer would prefer to agree on a way to distill that big set of color error data down to a very small set of numbers (ideally a single number) that could be used as a tolerance. Below that number is acceptable, above that number is unacceptable.

It's all about distillation of data

But what number to settle on? When there is a lot at stake (as in bank notes, lottery tickets and pharmaceutical labels) the statistic of choice might be the maximum. For these, getting the correct print is vitally important. For cereal boxes and high class lingerie catalogs (you know which ones I am talking about), the print buyer might ask for the 95^th percentile - 95% of the colors must be within a specified color difference ΔE. The printer might push for the average ΔE, since this number sounds less demanding. A stats person might go for the 68^th percentile, purely for sentimental reasons.

How to decide? I had a hunch that it really didn't matter which statistic was chosen, so I devised a little experiment with big data to prove it.

The distribution of color difference data

Some people collect dishwasher parts, and others collect ex-wives. Me? I collect data sets [4]. For this blog post I drew together measurements from 176 test targets. Some of these were printed on a lot of different newspaper presses, some were from a lot of ink jet proofers, some were printed flexography. For each, I found a reasonable set of aim color values [5], and I computed a few metric tons of color values in ΔE₀₀ [6].

Let's look at one set of color difference data. The graph represents the color errors from one test target with 1,617 different colors. The 1,617 color differences were then collected in a spreadsheet to make this CPDF (cumulative probability density function). CPDFs are not that hard to compute in a spread sheet. Plunk the data into the first column, and then sort this from small to large. If you like, you can get the percentages on the graph by adding a second column to the spreadsheet that goes from 0 to 1. If you have this second column to the right, then the plot will come out correctly oriented.

Example of the CPDF of color difference data

This plot makes it rather easy from the chart to read off any percentile. In red, I have shown the 50th percentile - something over 1.4 ΔE₀₀. If you are snooty, you might want to call this the median. In green, I have shown the 95th percentile - 3.0 ΔE₀₀. If you are snooty, you might want to call this the 95th percentile.

Now that we understand how a CPDF plot works, let's have a look at some of the 176 CPDF plots that I have at my beck and call. I have 9 of them below.

Sampling of real CPDFs

One thing that I hope is apparent is that, aside from the rightmost two of them, they all have more or less the same shape. This is a good thing. It suggests that maybe our quest might not be for naught. If they are all alike, then I could just compute (for example) the median of my particular data set, and then just select the CPDF from the curves above which has the same median. This would then give me a decent estimate of any percentile that I wanted.

How good would that estimate be? Here is another look at some CPDFs from that same data set. I chose all the ones that had a median somewhere close to 2.4 ΔE₀₀.

Sampling of real CPDFs with median near 2.4

How good is this for an estimate? This says that if my median were 2.4 ΔE₀₀, then the 90th percentile (at the extreme) might be anywhere from 3.4 to 4.6 ΔE₀₀_., but would likely be about 4.0 ΔE₀₀.

I have another way of showing that data. The graph below shows the relationship between the median and 90th percentile values for all 176 data sets. The straight line on the graph is a regression line that goes through zero. It says that 90th percentile = 1.64 * median. I may be an overly optimistic geek, but I think this is pretty darn cool. Whenever I see an r-squared value of 0.9468, I get pretty excited.

Ignore this caption and look at the title on the graph

Ok... I anticipate a question here. "What about the 95th percentile? Surely that can't be all that good!" Just in case someone asks, I have provided the graph below. The scatter of points is broader, but the r-squared value (0.9029) is still not so bad. Note that the formula for this is 95th percentile = 1.84 * median.

Ignore this one, too

Naturally, someone will ask if we can take this to the extreme. If I know the median, how well can I predict the maximum color difference? The graph below should answer that question. One would estimate the maximum as being 2.8 times the median, but look at the r-squared value: 0.378. This is not the sort of r-squared value that gets me all hot and bothered.

Max does not play well with others

I am not surprised by this. The maximum of a data set is a very unstable metric. Unless there is a strong reason for using this as a descriptive statistic, this is not a good way to assess the "quality" of a production run. This sounds like to sort of thing I may elaborate on in a future blog.

The table below tells how to estimate each of the deciles (and a few other delectable values) from the median of a set of color difference data. This table was generated strictly empirically, based on 176 data sets at my disposal. For example, the 10th percentile can be estimated by multiplying the median by 0.467. This table, as I have said, is based on color differences between measured and aim values on a test target [7].

P-tile	Multiplier	r-squared
10	0.467	0.939
20	0.631	0.974
30	0.762	0.988
40	0.883	0.997
50	1.000	1.000
60	1.121	0.997
68	1.224	0.993
70	1.251	0.991
80	1.410	0.979
90	1.643	0.947
95	1.840	0.903
99	2.226	0.752
Max	2.816	0.378

Caveats and acknowledgements

There has not been a great deal of work on this, but I have run into three papers.

Fred Dolezalek [8] posited in a 1994 TAGA paper that the CRF of ΔE variations of printed samples can be characterized by a single number. His reasoning was based on the statement that the distribution “should” be chi-squared with three degrees of freedom. He had test data from 19 press runs with an average of 20 to 30 sheet pulls. It’s not clear how many CMYK combinations he looked at, but it sounds like a few thousand data points, which is pretty impressive for the time for someone with an SPM 100 handheld spectrophotometer!

Steve Viggiano [9] considered the issue in an unpublished 1999 paper. He pointed out that the derivation ofthe chi-squared distribution with three degrees of freedom can be derived from the assumptions that ΔL*, Δa*, and Δb* values are normally distributed, have zero mean, have the same standard deviation, and are uncorrelated. He pointed out that these assumptions are not likely to be met with real data. I'm inclined to agree with Steve, since I hardly understand anything of what he tells me.

David McDowell [10] looked at statistical distributions of color errors of a large number of Kodak QC-60 Color Input Targets and came to the conclusion that this set of color errors could be modeled as a chi-squared function.

Clearly, the distribution of color errors could be anything it wants to be. It all depends on where the data came from. This point was not lost on Dolezalek. In his analysis, he found that the distribution only looked like a chi-squared distribution when the press was running stable.

Future research

What research paper is complete without a section devoted to "clearly further research is warranted"? This is research lingo for "this is why my project deserves to continue being funded"

I have not investigated whether the chi-squared function is the ideal function to fit all these distributions. Certainly it would be a good guess. I am glad to have a database that I can use to test this. While the chi-squared function makes sense, it is certainly not the only game in town. There are the logistic function, the Weibull function, all those silly beta functions... Need I go on? The names are as familiar to me as to everyone. Clearly further research is warranted.

Although I have access to lots of run time data, I have not investigated the statistical distributions of this data. Clearly further research is warranted.

Perhaps the chi-squared-ness of the statistical distribution of color errors is a measure of color stability? If there was a quick way to rate the degree that any particular data set fit the chi-squared function, maybe this could be used as an early warning sign that something is amiss. Clearly further research is warranted.

I have not attempted to perform Monte Carlo analysis on this, even though I know how to use random numbers to simulate physical phenomena, and even though I plan on writing a blog on Monte Carlo methods some time soon. Clearly further research is warranted.

I welcome additional data sets that anyone would care to send. Send me an email without attachment first, and wait for my response so that your precious data does not go into my spam folder: john@JohnTheMathGuy.com. With your help, further research will indeed be warranted.

Conclusion

My conclusion from this experiment is that the statistical distribution of color difference data, at least that from printing of test targets, can be summarized fairly well with a single data point. I have provided a table to facilitate conversion from the median to any of the more popular quantiles.

----------------------
[1] And if you are a color scientist, you are probably wondering when I am going to break into explaining the differences between deltaE ab, CMC, 94, and 2000 difference formulas, along with the DIN 99, and Labmg color spaces. Well, I'm not. At least not in this blog.

[2] For the uninitiated, color values are a set of three numbers (called CIELAB values) that uniquely defines a color by identifying the lightness of the color, the hue angle, and the degree of saturation.

[3] I have a Twitter account, so I have very little free time. Just in case you are taking a break from Twitter to read my blog, you can find me at @John_TheMathGuy when you get back to your real life of tweeting.

[4] Sometimes I just bring data up on the computer to look at. It's more entertaining than Drop Dead Diva, although my wife might disagree.

[5] How to decide the "correct" CIELAB value for a given CMYK value? If you happen to have a big collection of data that should all be similar (such as test targets that were all printed on a newspaper press) you can just average to get the target. I appreciate the comment from Dave McDowell that the statistical distribution of CIELAB values around the average CIELAB value will be different from the distribution around any other target value. I have not incorporated his comment into my analysis yet.

[6] Here is a pet peeve of mine. Someone might be tempted to say that they computed a bunch of ΔE₀₀ values. This is not correct grammar, since "ΔE" is a unit of measurement, just like metric ton and inch. You wouldn't say "measured the pieces of wood and computed the inch values," would you?

[7] No warranties are implied. Use this chart at your own risk. Data from this chart has not been evaluated for color difference data from sources other than those described. The chart represents color difference data in ΔE₀₀, which may have a different CPDF than other color difference formulas.

[8] Dolezalek, Friedrich, Appraisal of Production Run Fluctuations from Color Measurements in the Image, TAGA 1995

[9] Viggiano, J A Stephen, Statistical Distribution of CIELAB Color Difference, Unpublished, 1999

[10] McDowell, David (presumed author), KODAK Q-60 Color Input Targets, KODAK technical paper, June 2003

Wednesday, August 29, 2012

People do not make good statisticians

The lottery and gambling

The United States government has come to the realization that Japan is leading us in mathematical literacy. The government's approach to this, as with cigarettes and alcohol, is to attempt to change our behavior by putting a tax on what they don't like, in this case mathematical illiteracy. They call this tax the lottery.

Paraphrase of comedian Emo Phillips

Every American should learn enough statistics to realize that "One-in-25,000,000" is so close to "ZERO-in-25,000,000" that not buying a lottery ticket gives you almost virtually the same chance of winning as when you do buy one!

Mike Snider in MAD magazine, Super Special December 1995, p.48

I was in college when MacDonald's started their sweepstakes. Finding the correct gamepiece was going to make someone a millionaire. I had a friend named Peter[1] with a hunch. He was going to win.

I was, on the other hand, a math major. I considered my odds of being that one person in the United States who would be made incomprehensibly rich. There were a hundred million people trying to find that one lucky gamepiece. My chances were one in one hundred million of winning a million dollars. In my book, my long-run expectation was of my winning about a penny. Despite the fact that I was a poor student, scrounging to find tuition and rent, the prospect of winning (on average) one cent did not excite me. I was not about to go out of my way to earn this penny.

I was familiar with the Reader's Digest Sweepstakes. I had sat down and calculated the expected winnings in the sweepstakes. I expected to win something less than the price of the postage stamp I would need to invest in order to submit my entry, so I chose not to enter.

Peter was not a math major. Peter knew that if he was to win, he needed to put forth effort to appease the goddess Tyche[2]. Whenever we went out, whenever we passed the golden arches, he took us through the drive-through to pick up a gamepiece. Since future millionaires should not look like tightwads, he would order a little something. He would buy a soda and maybe an order of fries.

I took Peter to task for his silly behavior. I explained to him calmly the fundamentals of probability and expectation. I explained to him excitedly that he was being manipulated, being duped into spending much more money at MacDonald's than he would have normally. He told me that he would laugh when he received his one million dollars.

Did he win? No. In college, I took this as vindication that I was right. This event validated for me a pet theory: people are not good statisticians[3].

Our state (Wisconsin) has instituted Emo Phillips' tax on mathematical illiteracy. By not participating, I am a winner in the lottery. Profits from the lottery go to offset my property taxes. It is with mixed emotion that I receive this rebate each year. Like anyone else, I appreciate saving money. I even take a small amount of smug satisfaction that I win several hundred dollars a year from the lottery, and I have never purchased a lottery ticket. And I have made this money from people like Peter, who do not understand statistics.

One newsclip caught me in my smugness. The report characterized the typical buyer of a lotto ticket as surviving somewhere near the poverty level. I believe that we all have a right to decide where to spend our money. I don't think that the government should only sell lotto tickets to people who can prove that their income is above a certain level. I am, however, troubled by the image of my taxes being subsidized by an old woman who is just barely scratching out a living on a pension.

This image was enough for me to reconsider my mandate that people should base all their decisions on rational enumeration of the possible outcomes, assignation of probabilities, and computation of the expectation. What if I were the pensioner who never had enough money to buy a balanced diet after rent was paid? In the words of the song, "If you ain't got nothin', you got nothin' to lose." Is the pensioner buying a lottery ticket because he or she is not capable of rationally considering the options? Or are all options "bad", so the remote chance of making things significantly different is worth the risk. Not too long ago, I would have blamed the popularity of the lottery on mathematical illiteracy. Today I am not so sure.

Psychological perspective

Where observation is concerned, chance favors only the prepared mind.

Louis Pasteur

Aristotle maintained that women have fewer teeth than men; although he was twice married, it never occurred to him to verify this statement by examining his wives' mouths.

Bertrand Russel, The Impact of Science on Society

As engineers and scientists, we like to consider ourselves to be unbiased in our observations of the world. Worchel and Cooper (authors of the psychology text I learned from) lend support for our self-evaluation:

[Studies] demonstrate that if people are given the relevant information, they are capable of combining it in a logical way...

If we read on, we are given a different perspective of the ability of the human brain to tabulate statistics:

But will they? ... We know from studies of memory processes and related cognitive phenomena that information is not always processed in a way that gives each bit of information equal access and usefulness.

Worchel and Cooper go on to describe experimental evidence of people not weighing all data equally. Furthermore, we tend to be biased in our judgment of an event when we are involved in that event, our placement of blame in an accident depends on the extent of damages, and we generally weight a person's behavior higher than we weight the particular situation the person is in.

The primacy effect

Several other rules can be invoked to explain our faulty data collection. The first rule to explain what information is retained is the primacy effect. This states that the initial items are more likely to be remembered. This fits well with folklore like, "You never get a second chance to make a first impression," and "It is important to get off on the right foot." Statistically speaking, the primacy effect can be thought of as applying a higher weighting on the first few data points.

In one experiment of the primacy effect, the subject is shown a picture of a person, and is given a list adjectives describing this person. The order of the adjectives is changed for different subjects. After seeing the picture and word list, the subject is asked to describe the person. The subject's description most often agrees with the first few adjectives on the list.

The recency effect

The second rule to explain memory retention is the recency effect. This states that, for example, the last items on a list of words (the most recently seen items) are also more likely than average to be remembered. In other words, the most recent data points are also more heavily weighted than average. As an example of this, I remember what I had for lunch today, but I can barely remember what I had the day before. If my doctor were to ask me what I normally had for lunch, would my statistics be reliable?

The novelty effect

The third rule states that items or events which are very unusual are apt to be remembered. This is the novelty effect. I once had the pleasure to work in a group with a gentleman who stood 6'5". When he was standing with some other team members who were just over six foot, a remark was made that we certainly had a tall team. In going over the members of this team, I recall four men who were 6'2" or taller. But I also remember a dozen who were an average height of 5'8" to 6', and I recall two others who were around 5'4". The novelty of a man who was seven inches above average, and the image of him standing with other tall men, was enough to substitute for good statistics.

I recall one incident where a group of engineers was just beginning to get an instrument close to specified performance. The first time the instrument performed within spec, we joked that this performance is "typical". The second time the instrument performed within spec (with many trials in between), we upgraded the level of performance to "repeatable". The underlying truth of this joking was the tendency for all of us to only remember those occasions of extremely good performance.

The paradigm effect

A fourth rule which stands as a gatekeeper on our memory is the paradigm effect. This states that we tend to form opinions based on initial data, and that these opinions filter further data which we take in. An example of the paradigm effect will be familiar to anyone who has struggled to debug a computer program, only to realize (after reading through the code countless times) the mistake is a simple typographical error. The brain has a paradigm of what the code is supposed to do. Each time the code is read, the brain will filter the data which comes in (that is, filter the source code) according to the paradigm. If the paradigm says that the index variable is initialized at the beginning, or that a specific line does not have a semi-colon at the end, then it is very difficult to "see" anything else.

The paradigm effect is more pervasive than any objective researcher is willing to admit. I have found myself guilty of paradigms in data collection. I start an experiment with an expectation of what to see. If the experiment delivers this, I record the results and carry on with the next experiment. If the experiments fails to deliver what I expect, then I recheck the apparatus, repeat the calibration, double check my steps, etc. I have tacitly assumed that results falling out of my paradigm must be mistakes, and that data which fits my paradigm is correct. As a result, data which challenges my paradigm is less likely to be admitted for serious analysis.

An engineer by the name of Harold[4] had built up some paradigms about the lottery. He showed me that he had recorded the past few month's of lottery numbers in his computer. He showed me that three successive lottery numbers had a pattern. When he noticed this, Harold bought lots of lottery tickets. The pattern unfortunately did not continue into the fourth set of lottery numbers. As Harold explained it to me, "The folks at the lottery noticed the pattern and fixed it."

Harold's paradigm was that there were patterns in the random numbers selected by lottery machines. Harold had two choices when confronted with a pattern which did not continue long enough for him to get rich. He could assume that the pattern was just a coincidence, or he could find an explanation why the pattern changed. In keeping true to his paradigm, Harold chose the latter. When he explained this to me, I realized that it was fruitless to try to argue him out of something he knew to be true. I commented that the folks at the lotto had bigger and faster computers than Harold, just so they could keep ahead of him.

As another example of the paradigm effect, consider an engineer named William[5]. William was a heavy smoker and had his first heart attack in his mid-forties. He was asked once why he kept smoking, when the statistics were so overwhelming that continuing to smoke would kill him. William replied that his heart attack was due to stress. Smoking was his way of dealing with stress. To deprive himself of this stress relief would surely kill him. Furthermore, stopping smoking is stressful in and of itself.

William's paradigm was that he was a smoker. No amount of evidence could convince him that this was a bad idea. Evidently the paradigm is quite strong. In a recent study, roughly half of bypass patients continue to smoke after the surgery. William had six more heart attacks and died after his third stroke.

The primacy effect and the paradigm effect working together

The primacy effect and the paradigm effect often work together to make us all too willing to settle for inadequate data. My own observation is that people often settle for a few data points, and are often surprised to find out how shaky their observation is, statistically speaking.

A case in point is my belief that young boys are more aggressive than young girls. The first young girls I had opportunity to closely observe were my own two daughters, who I would not call aggressive. The first young boy I observed in any detail was the neighbor's, who I would call aggressive. My conclusion is that young boys are aggressive, and young girls are not.

Note that, if three people are picked at random, it is not terribly unlikely that the first person chosen is aggressive, and the other two are not. In other words, I have no need to appeal to a correlation between gender and aggressiveness to explain the data. The simple explanation of chance would suffice.

The primacy effect says that these three children were the most influential in shaping my initial beliefs. The paradigm effect says that the future data which I "record" will be the data which supports my initial paradigm.

In terms of evolution, one would be tempted to state that an animal with poor statistical abilities would not be as successful as an animal which was capable of more accurate statistical analysis. Surely the hypothetical Homo statistiens would be able to more accurately assess the odds of finding food or avoiding predators.

Consider the hypothetical Homo statistiens first encounter with a saber toothed tiger. Assume that he/she was lucky enough to survive the encounter. On the second encounter, Homo statistiens would reason that not enough statistics were collected to determine whether saber toothed tigers were dangerous. Any good statistician knows better than to draw any conclusions from the first data point. Clearly, there is an evolutionary advantage to Homo sapiens, who jumps to conclusions after the first saber toothed tiger encounter.

In the words of Desmond Morris,

Traumas... show clearly that the human animal is capable of a rather special kind of learning, a kind that is incredibly rapid, difficult to modify, extremely long-lasting and requires no practice to keep perfect.

The effect of peer pressure

When Richard Feynman was investigating the Challenger disaster, he uncovered another fine example of how poor people are at statistics. He was reading reports and asking questions about the reliability of various components of the Challenger, and found some wild discrepancies in the estimated probabilities of failure. In one meeting at NASA, Feynman asked the three engineers and one manager who were present to write down on a piece of paper the probability of the engine failing. They were not to confer, or to let the others see their estimates. The three engineers gave answers in the range of 1 in 200 to 1 in 300. The manager gave an estimate of 1 in 100,000.

This anecdote illustrates the wide gap in judgment which Feynman found between management and engineers. Which estimate is more reasonable? Feynman dug quite deeply into this question. He talked to people with much experience launching unmanned spacecraft. He reviewed reports which analytically assessed the probability of failure based on the probability of failure of each of the subcomponents, and of each of the subcomponents of the subcomponents, and so on. He concludes:

If a reasonable launch schedule is to be maintained, engineering often cannot be done fast enough to keep up with the expectations of the originally conservative certification criteria designed to guarantee a very safe vehicle... The shuttle therefore flies in a relatively unsafe condition, with a chance of failure on the order of a percent.

On the other hand, Feynman is particularly candid about the "official" probability of failure:

If a guy tells me the probability of failure is 1 in 105, I know he's full of crap.

How can it be that the bureaucratic estimate of failure disagrees so sharply with the more reasonable engineer's estimate? Feynman speculates that the reason for this is that these estimates need to be very small in order to ensure continued funding. Would congress be willing to invest billions of dollars on a program with a one in a hundred chance of failure? As a result, much lower probabilities are specified, and calculations are made to justify that this level of safety can be reached.

I am reminded of another experiment which was devised by the psychologist Solomon Asch in 1951. In this experiment, the subject was told that this was an experiment investigating perception. The subject was to sit among four other "subjects", who are actually confederates. The "subjects" were shown a set of lines on a piece of paper (for example) and are asked to state out loud which line was longest. The actors were called on first, one at a time. They were instructed to give obviously incorrect answers in 12 of 18 trials, but they were to all agree on the incorrect answer.

It was found in that 75% of subjects caved into peer pressure, and agreed with the obviously incorrect answers. When asked about their answers later, away from the immediate effects of peer pressure, the subjects held to their original answers, incorrect or not. As far as can be measured with this psychological experiment, the subjects came to believe that a two inch long line was shorter than a one inch long line.

So it is with NASA's reliability data. The data may never have had any shred of credence whatsoever, but simply by repeating "1 in 100,000" often enough, it became truth.

I have included Feynman's example not to put down NASA, or promote the ever-popular game of "manager bashing", but to illustrate this ever-so-human trait that we are all prone to. We believe what others believe, and we believe what we would like to be true.

Summary

The effects mentioned here together support the statement that people do not make good statisticians. The point which is made is not that "mathematically inept people are poor statisticians", or that "people are incapable of performing good statistics". The point is that the natural tendency is for people to not be good at objectively analyzing data. This goes for high school drop-outs as well as engineers, scientists and managers. In order for people to produce good statistics, they need to rely not on their memory and intuition, but on paper and statistical calculations.

Bibliography

Feynman, Richard P., What do you care what other people think?, 1988, Penguin Books Canada Ltd.

Flanagan, Dennis, Flanagan's Version, 1989, Random House

Kresch, David, Crutchfield, Richard S., Livson, Norman, Elements of Psychology, Third Edition, 1974 Alfred Knopf, Inc.

Morris, Desmond, The Human Zoo, 1969, McGraw Hill

Worchel, Stephen, and Cooper, Joel, Understanding Social Psychology, revised edition 1979, Doresy Press

[1] Not his real name.

[2] Tyche was the Greek goddess of luck.

[3] I met one person whose behavior indicated that he was not a good statistician, therefore all people are not good statisticians...[This demonstrates that I am not a good statistician, since I am content with a sample size of . There are therefore, two people who are not good statisticians, and this further proves my point.]

[4] Not his real name, either.

[5] You guessed it. Not his real name.