## Tuesday, October 25, 2016

### Statistical process control of color difference data, part 2

Last week, some stark raving mad heretic grabbed my blogging pen, spouting out some blasphemy about how the classical approach to process control is doomed to fail for color difference data. Asteroids laying waste to heavily populated areas, cats sleeping with dogs, my local Starbucks being out of chai... all that doomsday stuff.

Well, perhaps the guy who was using my blogging pen wasn't stark raving mad. Maybe he was just stark raving "mildly annoyed"? And maybe the heretic wasn't just some other guy? I don't want to point the finger, but it might have been me who wrote the blog post. So, perhaps I need to take his contentious assertion seriously?

Here are the sacrilegious assertions from last week's blog post:

Part 1 - Color difference data does not fit a Normal Distribution.
Part 2 - Classical SPC is largely based on the assumption of normality, so much of it does not work well for color difference data.

I submit the chart below as evidence for the first assertion.

This is not normal data!

I need to give some provenance for this data.

In 2006, the SNAP committee (Specifications for Newspaper Advertising Production) took on a large project to come to some consensus about what color you get when you mix specific quantities of CMYK ink on newsprint. A total of 102 newspapers printed a test form on its presses. The test form had 928 color patches. All of the test forms were measured by one very busy spectrophotometer. The data was averaged by patch type, and it became known as CGATS TR 002.

For this blog post, I had a close look at the original data. For each of the 928 patches and for each of the 102 printers, I compared the average L*a*b* value against the measured L*a*b* value. As a result, I had just short of 100K color difference values (in ΔE00).

Of the 94,656 color differences, there were 1,392 that were between 0.0 ΔE00 and 0.5 ΔE00. There were 7,095 between 0.5 ΔE00 and 1.0 ΔE00. And so on. The blue bars in the above chart are a histogram of this color difference data.

I computed the mean and standard deviation of the color difference data: 2.93, and 1.78, respectively. The orange line in the above chart is a normal distribution with those values. Now, we all like to think our data is normal. We all like to think that our data doesn't skew to the right or to the left. The bad news for this election season is that our color difference data is not normal. It is decidedly skewed to the left. (I provide no comment on whether other data in this election season is skewed either to the right or to the left.)

The coefficient of skewness of this distribution is about 1.0, which is about 125 times the skewness that one might expect from a normal distribution. "The data is skewed, Jim!"

The data is skewed, Jim!

Ok. So Bones tells us the data is skewed?  Someone may argue that I have committed the statistical equivalent of a venial sin. True. I combined apples and oranges. When I computed the color differences, I was comparing apples to apples, but then I piled all the apple differences and all the orange differences into one big pile. Is there some reason to put the variation of solid cyan patches in the same box as the variation of 50% magenta patches?

Just to check that, I pulled out the patches individually, and did the skewness test on each of the 928 sets of data. Sorry, nit pickers. Same results. "The data is still skewed, Jim!"

The data is still skewed, Jim!

Yeah, but who cares?  The whole classical process control thing will still work out, right? Well.... maybe. Kinda maybe. Or, kinda maybe probably not.

I looked once again at the data set. For each of the 928 patches, I computed the 3 sigma upper limit for color difference data. Then I counted outliers. Before I go on, I will come up with a prediction of how many outliers we expect to see.

One would think that the folks doing these 102 press runs were reasonably diligent in the operation of the press for these press runs. The companies all volunteered their time, press time, and materials to this endeavor, so presumably they cared about getting good results. I think it is reasonable to assume that on the whole, they upped their game, if only a little bit just to humor the boss.

Further, back in 2006, several people (myself included) blessed the data. No one could come up with any strong reason to remove any of the individual data points.

So, I am going to state that the variation in the data set should be almost entirely "common cause" variation. This is the inevitable variation that we will see out of any process. Now, let's review the blog post of an extremely gifted and bashful applied mathematician and color scientist. Last week, I wrote the following:

If the process produces normal data, and if nothing changes in our process, then 99.74% of the time, the part will be within those control limits. And once every 400 parts, we will find a part that is nothing more than an unavoidable statistical anomaly.

There were 94,656 data points, and we expect 0.26% outliers... that would put the expectation at about 249 outliers in the whole bunch. Drum roll, please... I found 938! For this data set, I found four times as many outliers as expected.

To put this in practical terms, if a plant were to have followed traditional statistical process control methods on this set of color difference data, they would be shutting down the presses to check it's operation four times as often as they really should. This is a waste of time and money, and as Deming would tell us, stopping the presses and futzing with them just causes additional variation.

I should remark that this factor of four is based on one data set. I think it is a good data set, since it is very much real world. But perhaps it includes additional variation because there were 102 printing plants involved? Perhaps there is some idiosyncrasy in newspaper presses? Perhaps there is an idiosyncrasy involved in using the average of all 102 to determine the target color?

I would caution against trying to read too much into the magic factor of four that I arrived at for this data set. But, I will hold my ground and say that the basic principle is sound. Color difference data is not normally distributed, so the basic assumptions about statistical process control are suspect.

In next week's installment of this exciting series, I will investigate the theoretical basis for non-normality of color difference data.

Move on to Part 3

1. It would be interesting to see plots of the component differences for this dataset. Conventional wisdom says they will be normally distributed. If that is so, what might you do with this feature? Dealing with multivariate process control specifications is a pain, but folks do it.

2. Stay tuned for the next in the series, Dave! I will, indeed, look at the individual distributions for L*, for a*, and for b*.

3. You cannot get normal distrubtion because DeltaE's cannot be negative. Isn't this a Chi-squared distribution?

4. Max, here is a teaser from my next part of this series:
"For a true normal distribution, there is always a chance - perhaps an incredibly tiny chance - that the values could be negative. But color difference data never goes negative."

As for your chi-squared conjecture, I can't quote from it, since I haven't written about it yet! But I intend to mention papers from Fred Dolezelak, David MacDowell, and Steve Viggiano about deltaE and chi-squared.

1. There is also a nice 2011 CR&A article from Nadal, Miller and Fairman discussing "Statistical Methods for Analyzing Color Difference Distributions."

2. Not chi-squared, but chi. Under several unrealistic assumptions, such as L*, a*, and b* being identically, independently, normally distributed. The assumption of zero mean also is made to obtain the chi distribution with three degrees of freedom, but this is less unrealistic for a process operating close to its aim.

3. Thank you for the clarification, Steve. You are correct that it is not deltaE which is chi-squared, but rather deltaE squared. A very smart guy once wrote that another very smart guy told him the the equivalent distribution for deltaE would be the Rayleigh distribution. If I recall correctly, the first smart guy is someone you have lunch with every single day that you have lunch. And the second smart guy is Ed Granger.

As for the unrealistic assumptions, that is the main topid for the next blog post in this series.

5. Nice articles! I'm wondering if the distributions you are observing are related only to it being color difference data or if it also has something to do with the printing process. What happens if you apply this logic to camera capture process control such as you see with FADGI and Metamorfoze.

6. Brian, thanks for sharing the article. I had not seen it before.

Anonymous, printing certainly has its idiosyncrasies. Certainly some of this shows up in the data I have above. And I will admit that most of the data sets that I have examined are from print.

But, I think the upcoming installments will show that there is an idiosyncrasy with color difference data that is fundamental and based on the underlying math.

7. You might be interested in this post from the excellent 'SPC for Excel' site:
https://www.spcforexcel.com/knowledge/basic-statistics/deciding-which-distribution-fits-your-data-best

8. John, I'm using Dagum I. It fits well in most cases. Just asking what have an income distribution to do with delta E.
Are the high delta E as rare as high income?

9. I am guessing that income distribution has much longer tails that deltaE data. Typically, the 99th percentile of deltaE is a little more than twice the size of the median.

Are you using the Dagum distribution for deltaE data, or for income data?