Last week, some stark raving mad heretic grabbed my blogging pen, spouting out some blasphemy about how the classical approach to process control is doomed to fail for color difference data. Asteroids laying waste to heavily populated areas, cats sleeping with dogs, my local Starbucks being out of chai... all that doomsday stuff.
Well, perhaps the guy who was using my blogging pen wasn't stark raving mad. Maybe he was just stark raving "mildly annoyed"? And maybe the heretic wasn't just some other guy? I don't want to point the finger, but it might have been me who wrote the blog post. So, perhaps I need to take his contentious assertion seriously?
Here are the sacrilegious assertions from last week's blog post:
Part 1 - Color difference data does not fit a Normal Distribution.
Part 2 - Classical SPC is largely based on the assumption of normality, so much of it does not work well for color difference data.
I submit the chart below as evidence for the first assertion.
I need to give some provenance for this data.
In 2006, the SNAP committee (Specifications for Newspaper Advertising Production) took on a large project to come to some consensus about what color you get when you mix specific quantities of CMYK ink on newsprint. A total of 102 newspapers printed a test form on its presses. The test form had 928 color patches. All of the test forms were measured by one very busy spectrophotometer. The data was averaged by patch type, and it became known as CGATS TR 002.
For this blog post, I had a close look at the original data. For each of the 928 patches and for each of the 102 printers, I compared the average L*a*b* value against the measured L*a*b* value. As a result, I had just short of 100K color difference values (in ΔE00).
Of the 94,656 color differences, there were 1,392 that were between 0.0 ΔE00 and 0.5 ΔE00. There were 7,095 between 0.5 ΔE00 and 1.0 ΔE00. And so on. The blue bars in the above chart are a histogram of this color difference data.
I computed the mean and standard deviation of the color difference data: 2.93, and 1.78, respectively. The orange line in the above chart is a normal distribution with those values. Now, we all like to think our data is normal. We all like to think that our data doesn't skew to the right or to the left. The bad news for this election season is that our color difference data is not normal. It is decidedly skewed to the left. (I provide no comment on whether other data in this election season is skewed either to the right or to the left.)
The coefficient of skewness of this distribution is about 1.0, which is about 125 times the skewness that one might expect from a normal distribution. "The data is skewed, Jim!"
Ok. So Bones tells us the data is skewed? Someone may argue that I have committed the statistical equivalent of a venial sin. True. I combined apples and oranges. When I computed the color differences, I was comparing apples to apples, but then I piled all the apple differences and all the orange differences into one big pile. Is there some reason to put the variation of solid cyan patches in the same box as the variation of 50% magenta patches?
Just to check that, I pulled out the patches individually, and did the skewness test on each of the 928 sets of data. Sorry, nit pickers. Same results. "The data is still skewed, Jim!"
Yeah, but who cares? The whole classical process control thing will still work out, right? Well.... maybe. Kinda maybe. Or, kinda maybe probably not.
I looked once again at the data set. For each of the 928 patches, I computed the 3 sigma upper limit for color difference data. Then I counted outliers. Before I go on, I will come up with a prediction of how many outliers we expect to see.
One would think that the folks doing these 102 press runs were reasonably diligent in the operation of the press for these press runs. The companies all volunteered their time, press time, and materials to this endeavor, so presumably they cared about getting good results. I think it is reasonable to assume that on the whole, they upped their game, if only a little bit just to humor the boss.
Further, back in 2006, several people (myself included) blessed the data. No one could come up with any strong reason to remove any of the individual data points.
So, I am going to state that the variation in the data set should be almost entirely "common cause" variation. This is the inevitable variation that we will see out of any process. Now, let's review the blog post of an extremely gifted and bashful applied mathematician and color scientist. Last week, I wrote the following:
If the process produces normal data, and if nothing changes in our process, then 99.74% of the time, the part will be within those control limits. And once every 400 parts, we will find a part that is nothing more than an unavoidable statistical anomaly.
I submit the chart below as evidence for the first assertion.
This is not normal data!
I need to give some provenance for this data.
In 2006, the SNAP committee (Specifications for Newspaper Advertising Production) took on a large project to come to some consensus about what color you get when you mix specific quantities of CMYK ink on newsprint. A total of 102 newspapers printed a test form on its presses. The test form had 928 color patches. All of the test forms were measured by one very busy spectrophotometer. The data was averaged by patch type, and it became known as CGATS TR 002.
For this blog post, I had a close look at the original data. For each of the 928 patches and for each of the 102 printers, I compared the average L*a*b* value against the measured L*a*b* value. As a result, I had just short of 100K color difference values (in ΔE00).
Of the 94,656 color differences, there were 1,392 that were between 0.0 ΔE00 and 0.5 ΔE00. There were 7,095 between 0.5 ΔE00 and 1.0 ΔE00. And so on. The blue bars in the above chart are a histogram of this color difference data.
I computed the mean and standard deviation of the color difference data: 2.93, and 1.78, respectively. The orange line in the above chart is a normal distribution with those values. Now, we all like to think our data is normal. We all like to think that our data doesn't skew to the right or to the left. The bad news for this election season is that our color difference data is not normal. It is decidedly skewed to the left. (I provide no comment on whether other data in this election season is skewed either to the right or to the left.)
The coefficient of skewness of this distribution is about 1.0, which is about 125 times the skewness that one might expect from a normal distribution. "The data is skewed, Jim!"
The data is skewed, Jim!
Ok. So Bones tells us the data is skewed? Someone may argue that I have committed the statistical equivalent of a venial sin. True. I combined apples and oranges. When I computed the color differences, I was comparing apples to apples, but then I piled all the apple differences and all the orange differences into one big pile. Is there some reason to put the variation of solid cyan patches in the same box as the variation of 50% magenta patches?
Just to check that, I pulled out the patches individually, and did the skewness test on each of the 928 sets of data. Sorry, nit pickers. Same results. "The data is still skewed, Jim!"
The data is still skewed, Jim!
Yeah, but who cares? The whole classical process control thing will still work out, right? Well.... maybe. Kinda maybe. Or, kinda maybe probably not.
I looked once again at the data set. For each of the 928 patches, I computed the 3 sigma upper limit for color difference data. Then I counted outliers. Before I go on, I will come up with a prediction of how many outliers we expect to see.
One would think that the folks doing these 102 press runs were reasonably diligent in the operation of the press for these press runs. The companies all volunteered their time, press time, and materials to this endeavor, so presumably they cared about getting good results. I think it is reasonable to assume that on the whole, they upped their game, if only a little bit just to humor the boss.
Further, back in 2006, several people (myself included) blessed the data. No one could come up with any strong reason to remove any of the individual data points.
So, I am going to state that the variation in the data set should be almost entirely "common cause" variation. This is the inevitable variation that we will see out of any process. Now, let's review the blog post of an extremely gifted and bashful applied mathematician and color scientist. Last week, I wrote the following:
If the process produces normal data, and if nothing changes in our process, then 99.74% of the time, the part will be within those control limits. And once every 400 parts, we will find a part that is nothing more than an unavoidable statistical anomaly.
There were 94,656 data points, and we expect 0.26% outliers... that would put the expectation at about 249 outliers in the whole bunch. Drum roll, please... I found 938! For this data set, I found four times as many outliers as expected.
To put this in practical terms, if a plant were to have followed traditional statistical process control methods on this set of color difference data, they would be shutting down the presses to check it's operation four times as often as they really should. This is a waste of time and money, and as Deming would tell us, stopping the presses and futzing with them just causes additional variation.
Traditional statistical process control of color difference data is dead, Jim!
I should remark that this factor of four is based on one data set. I think it is a good data set, since it is very much real world. But perhaps it includes additional variation because there were 102 printing plants involved? Perhaps there is some idiosyncrasy in newspaper presses? Perhaps there is an idiosyncrasy involved in using the average of all 102 to determine the target color?
I would caution against trying to read too much into the magic factor of four that I arrived at for this data set. But, I will hold my ground and say that the basic principle is sound. Color difference data is not normally distributed, so the basic assumptions about statistical process control are suspect.
In next week's installment of this exciting series, I will investigate the theoretical basis for non-normality of color difference data.
Move on to Part 3
Move on to Part 3