In a previous post, I looked at how the Zc statistic can be used to isolate individual color measurements that are icky-poopy. Today I look at a slightly broader question: How can I tell if the whole production run is wonky?
I think something went wrong in this production run
I deliberately use the word "wonky" because it has a squishy meaning, which is helpful, since I'm not sure what I want it to mean just yet! So, bear with me while I fumble around in the darkness, not knowing quite what I am doing just yet.
Here is the premise: Color, in a well-controlled production run, should vary according to some specific type of statistical distribution. (Math mumbo-jumbo alert) I will take a guess that the cloud of points in that ellipsoid of L*a*b* values is a three-dimensional Gaussian distribution, with the axes appropriately tilted and elongated. If this is the case, then the distribution of Zc will be chi with three degrees of freedom. (End math mumbo-jumbo alert.)
If you are subscribed to the blog post reader that automatically removes sections of math mumbo-jumbo, then I will recap the last paragraph in a non-mumbo-jumbo way. In stats, we make the cautious assumption of the normal distribution. Since I am inventing this three-dimensional stats thing, I will cautiously assume the three-dimensional equivalent. But, since this is virgin territory, I will start by testing this assumption.
A quick note about CIELAB target values and DE
This blog post is not about CIELAB target values and DE. Today, I'm not talking about assessing conformance, so DE is not part of the discussion. I am talking about whether the process is stable, not whether it's correct.
A look at some real good data
Kodak produced a photographic test target, known as the Q60 target, which was used to calibrate scanners. The test target would be read by a scanner, and the RGB values which were read were compared against L*a*b* values for that batch of targets in order to calibrate the scanners. When the scanner encountered that same type of film, this calibration would be used to convert from RGB values to moderately accurate L*a*b* values. Hundreds of thousands of these test targets were manufactured between 1993 and 2008.
I think the lady peeking out on the right is sweet on me
We know that these test targets were produced under stringent process control. They were expensive, and expensive always means better. More importantly, they were produced under the direction of Dave McDowell. I have worked with him for many years in standards groups, and I know they don't come more persnickety about getting the details right than him!
Dave provided me with data on 76 production runs of Ektachrome, which was averages of the L*a*b* values from 264 patches, for a total of about 20K data points. So, I had a big pile of data, collected of production runs that were about as well regulated as you can get.
I applied my magic slide rule to each set of the 264 sets of 76 color values. Note that I pooled at the data for individual colors of patches. General rule in stats: You don't wanna be putting stuff in the same bucket that belongs in different buckets. They will have different distributions.
Within each of the 264 buckets, I computed Zc values. Twenty thousand of them. I hope you're appreciative of all the work that I did for this blog post. Well... all the work that Mathematica did.
Now, I could have looked at them all individually, but the goal here is to test my 3D normal assumption. I'm gonna use a trick that I learned from Dave McDowell, which is called the CPDF.
Note on the terminology: CPDF stands for cumulative probability density function). At least that's the name that it was given in the stats class that I flunked out of in college. It is also called CPD (cumulative probability distribution), CDF (cumulative distribution function), and in some circles it's affectionately known as Clyde. In the graphic arts standards committee clique, it has gone by the name of CRF (cumulative relative frequency).
Here is the CPDF of the Ektrachrome data set. I through all the Zc values into one bucket. In this case I can do this. They belong in the same bucket, since they are all dimensionless numbers... normalized to the same criteria. The solid blue line is the actual data. If you look real close, you can see a dotted line. That dotted line is the theoretical distribution for Zc that 3D normal would imply. Not just one particular distribution -- the only one.
20,000 color measurements gave their lives for this plot
Rarely do I see real world data that comes this close to fitting a theoretical model. It is clear that L*a*b* data can be 3D normal.
More real world data
I have been collecting data. Lots of it. I currently have large color data sets form seven sources, encompassing 1,245 same-color data sets, and totalling 325K data points. When I can't sleep at night, I get up and play with my data.
[Contact me if you have some data that you would like to share. I promise to keep it anonymous. If you have a serious question that you want to interrogate your data with, all the better, Contact me. We can work something out.]
I now present some data from Company B, which is one of my anonymous sources. I know you're thinking this, but no. This is not where the boogie-woogie bugle boy came from. This complete data set includes 14 different printed patches, sampled from production runs over a full year. Each set has about 3,700 data points.
I first look at the data from the 50% magenta patch, since it is the most well-behaved. The images below are scatterplots of the L*a*b* data projected onto the a*b* plane, the a*L* plane, and the b*L* plane. The dashed ellipses are the 3.75 Zc ellipses. One might expect one out of 354 data points to be outside of those ellipses.
Three views of the M 50 data from Company B
Just in case you wanted to see a runtime chart, I provide one below. The red line is the 3.75 Zc cutoff. There were 24 data points where Zc > 3.75. This compares to the expectation of 10.5. This is the expectation under the assumption that the distribution is perfectly 3D normal. I am not concerned about this difference; it is my expectation that real life data will normally exceed the normal expectations by a little bit.
Another view of the M 50 data - Zc runtime plot
So far, everything looks decent. No big warning flags. Let's have a look at the CPDF. PArdon my French, but this looks pretty gosh-darn spiffy. The match to the theoretical curve (the dotted line) is not quite as good as the Ektachrome data, but it's still a real good approximation.
Another great match
Conclusion so far, the variation in color data really can be 3D normal!
Still more real world data
I show below the CPDF of Zc for another data set from that same source, Company B. This particular data set is a solid cyan patch. The difference between the real data and the theoretical distribution is kinda bad.
A poor showing for the solid cyan patch
So, either there is something funky about this data set, or my assumption is wrong. Maybe 3D normal isn't necessarily normal? Let's zoom in a bit on this data set. First, we look at the runtime chart. (Note that this chart is scaled a bit different than the previous. This one tops out at Zc = 8, whereas the other goes up to 5.5.)
A runtime chart with some aberrant behavior
that will not go unpunished!
that will not go unpunished!
There are clearly some problems with this data. I have highlighted (red ellipse) two short periods where the color was just plain wonky. Some of the other outliers are a bit clustered as well. Below I have an a*b* scatter plot of that data (on the left), and a zoomed-in portion of that plot which shows some undeniable wonk.
Look at all the pretty dots that aren't in the corral where they belong
I'm gonna say that the reason that the variation in this data set does not fit the 3D normal model is because this particular process is not in control. The case gets stronger that color variation is 3D normal when the process is under control.
Are you tired of looking at data yet?
We have looked at data from Company K and Company B. How about two data sets from Company R? These two data sets are also printed colors, but they are not the standard process colors. There are about 1,000 measurements of a pink spot color, and 600 measurements of a brown spot color. One new thing in this set... these are measurements from an inline system, so they are all from the same print run.
First we look at the CPDF for the pink data. Yes! I won't show the scatterplots in L*a*b*, but trust me. They look good. Another case of "3D normal" and "color process in good control" going hand-in-hand.
Yet another boring plot that corroborates my assumptions
Next we see the CPDF of Zc for the brown data. It's not as good as the pink data, or the Kodak, or the M 50 CPDF plots, but not quite as bad as the C 100. So, we might think that the process for brown is in moderate control?
Brown might not be so much in control?
The runtime chart of Zc looks pretty much like all the others (I could plop the image in here, but it wouldn't tell us much). The scatter plots of L*a*b* values also look reasonable... well, kinda. Let's have a look.
Halley's comet? Or a scatterplot of variation in brown?
.This data doesn't look fully symmetric. It looks like it's a little skewed toward the lower left. And that is why the CPDF plot of brown looks a bit funky. Once again, we see that the CPDF of Zc values for a set of color variation is a decent way to quickly assess whether there is something wrong with the process.
But why is the brown plot skewed? I know the answer, but we're gonna have to wait for the full exposition in another blog post.
For the tine being, let me state the thrilling conclusion of this blog post.
The thrilling conclusion of this blog post
When a color producing process is "in control" (whatever that means), the variation in L*a*b* will be 3D normal. This means that we can look at the CPDF of Zc as a quick way to tell if we have exited the ramp to Wonkyville.