Tuesday, November 21, 2017

Statistics of multi-dimensional data, example

In the previous blog post, Statistics of multi-dimensional data, theory, I introduced a generalization of the standard deviation to three-dimensional data. I called it ellipsification. In this blog post I am going to apply this ellipsification thing to real data to demonstrate the application to statistical process control of color.

I posted this cuz there just aren't enough trolls on the internet

Is the data normal?

In traditional SPC, the assumption is almost always made that the underlying variation is normally distributed. (This assumption is rarely challenged, so we blithely use the hammers that are conveniently in our toolbox -- standard SPC tools -- to drive in screws. But that's another rant.)

The question of normalcy is worth addressing. First off, since I am at least pretending to be a math guy, I should at least pay lip service to stuff that has to do with math. Second, we are venturing into uncharted territory, so it pays to be cautious. Third, we already have a warning that deltaE color difference is not normal. Ok, maybe a bunch of warnings. Mostly from me.

I demonstrate in the next section that my selected data set can be transformed into another data set with components that are uncorrelated, have zero mean and standard deviation of 1.0, and which give every indication of being normal. So, one could us this transform on the color data and apply traditional SPC techniques on the individual components, but you will see that I take this one step further.

    Original data

I use the solid magenta data from the data set that I describe below in the section below called "Provenance of the data". I picked magenta because it is well known that it has a "hook". In other words, as you increase pigment level or ink film thickness, it changes hue. The thicker the magenta ink, the redder it goes. Note that this can be seen in the far left graph as a tilt to the ellipsoid.

I show three views of the data below. The black ellipses are slices through the middle of the ellipsification in the a*b* plane, the L*a* plane, and the L*b* plane, respectively.

View from above

View from the b* axis

View from the a* axis

    Standardized data

Recall for the moment when you were in Stats 201. I know that probably brings up memories of that cute guy or girl that sat in the third row, but that's not what I am talking about. I am talking about standardizing the data to create a Z score. You subtracted the mean and then divided by the standard deviation so that the standardized data set has zero mean, and standard deviation of 1.0.

I will do the same standardization, but generalized to multiple dimensions. One change, though. I need an extra step to rotate the axes of the ellipsoid so that all the axes are aligned with the coordinate axes. The cool thing is that the new scores (call them Z1, Z2, and Z3, if you like) are now all uncorrelated.

Geometrically, the operations are as follows: subtract the mean, rotate the ellipsoid, and then squish or expand the individual axes to make the standard deviations all equal to 1.0. The plot below show three views of the data after standardization. (Don't ask me which axes are L*, a*, and b*, by the way. These are not L*, a*, or b*.)

Standardized version of the L*, a*, and b* variation charts

Not much to look at -- some circular blobs with perhaps a tighter pattern nearer the origin. That's what I would hope to see. 

Here are the stats on this data:

Mean Stdev Skew Kurtosis
Z1  0.000  1.000 -0.282  -0.064
Z2  0.000   1.000  0.291   0.163
Z3  0.000  1.000 -0.092  -0.658

The mean and standard deviation are exactly 0.000 and 1.000. This is reassuring, but not a surprise. It just means that I did the arithmetic correctly. I designed the technique to do this! Another thing that happened by design is that the correlations between Z1 and Z2, and between Z1 and Z3 are both exactly 0.000. Again, not a surprise. Driving those correlations to zero was the whole point of rotating the ellipsoid, which I don't mind saying was no easy feat.

The skew and kurtosis are more interesting. For an ideal normal distribution, these two values will be zero. Are they close enough to zero? None of these numbers are big enough to raise a red flag. (In the section below entitled "Range for skew and kurtosis", I give some numbers to go by to scale our expectation of skew and kurtosis.)

In the typical doublespeak of a statistician, I can say that there is no evidence that the standardized  color variation is not normal. Of course, that's not to say that the standardized color variation actually is normal, but a statement like that would be asking too much from a statistician. Suffice it to say that it walks like normally distributed data and quacks like normally distributed data.

Dr. Bunsen Honeydew lectures on proper statistical grammar

This is an important finding. At least for this one data set, we know that the standardized scores Z1, Z2, and Z3 can be treated independently as normally distributed variables. Or, as we shall see in the next section, we can combine them into one number that has a known distribution.

Can we expect that all color variation data behaves this nicely when it is standardized by ellipsification? Certainly not. If the data is slowly drifting, the standardization might yield something more like a uniform distribution. If the color is bouncing back and forth between two different colors, then we expect the standardized distributions to be bi-modal. But I intend to look at a lot of color to try to see if 3D normal distribution is the norm for processes that are in control.

In the words of every great research paper every written, "clearly more research is called for".

The Zc statistic

I propose a statistic for SPC of color, which I call Zc. This is a generalization of the Z statistic that we all know and love. This new statistic could be applied to any multi-dimensional data that we like, but I am reserving the name to apply to three-dimensional data, in particular, to color data. (The c stands for "color". If you have trouble remembering that, then note that c is the first letter of my middle name.)

Zc is determined by first ellispifying the data set. The data set is then standardized, and then each data point is reduced to a single number (a scalar), as described in the plot below. The red points are a standardization of the data set we have been working with.the data set we have been working with. I have added circles at Zc of 1, 2, 3, 4. Any data points on one of these circles will have a Zc score of the corresponding circle. Points in between will have intermediate values, which are the distance from the origin. Algebraically, Zc is the sum in quadrature of the individual three components, that is to say, the square root of the sum of the squares of the three individual components.

A two-dimensional view of the Z scores

Now that we have standardized our data into three uncorrelated random variables that are (presumably) Gaussian with zero mean and unit standard deviation, we can build on some established statistics. The sum of the squares of our standardized variable will follow a chi-squared distribution, and the square root of the sums of the squares will follow a chi distribution. Note that this quantity is the distance from the data point to the origin.

Chi is the Greek version of our letter X. It is pronounced with the hard K sound, although I have heard neophytes embarrass themselves by pronouncing it with the ch sound. To make things even more confusing, there is a Hebrew letter chai which is pronounced kinda like hi, only with that rasping thing in the back of your throat. Even more confusing is the fact that the Hebrew chai looks a lot like the Greek letter pi, which is the mathematical symbol for all things circular like pie and cups for chai tea. But the Greek letter chi has nothing to do with either chai tea, or its Spoonerism tai chi.

Whew. Glad I got that outa the way.

Why is it important that we can put a name on the distribution? This gives us a yardstick from which to gauge the probability that any given data point belongs to the set of typical data. The table below gives some probabilities for the Zc distribution. Here is an example that will explain the table a bit. The fifth row of the table says that 97% of the data points that represent typical behavior will have Zc scores of less than 3.0. Thus the chance that a given data point will have a Zc score larger than that is 1 in 34.

Levels of significance of Zc

Zc  P(Zc)Chance
1.00.19875     1
1.50.47783     2
2.00.73854     4
2.50.89994    10
3.00.97071    34
3.50.99343   152
4.00.99887   882
4.50.99985  6623
5.00.99999 66667

The graph below is a run time chart of the Zc scores for the 204 data points that we have been dealing with. The largest score is about 3.5. We would be hard pressed at calling this an aberrant point, since the table above says that there is a 1 in 152 chance of such data happening at random. By the way, we had close to 152 data points, so we should expect 1 data point above 3.5. A further test: I count eight data points where the Zc score is above 3.0. Based on the table, I expect about 6.

My conclusion is that there is nothing funky about this data.

Runtime chart for Zc of the solid magenta patches

Where do we draw the line between common cause and special cause variation? In traditional SPC, we use Z > 3 as the test for individual points. Note that for a normal distribution, the probability of Z < 3 is 0.99865, or one chance in 741 of Z < 3.0. This is pretty close to the probability of Zc < 4 for a chi distribution. In other words, if you are using Z > 3 as a threshold for QC with normally distributed data, then you should use Zc > 4 when using my proposed Zc statistic for color data. Four is the new three.

Provenance for this data

In 2006, the SNAP committee (Specifications for Newspaper Advertising Production) took on a large project to come to some consensus about what color you get when you mix specific quantities of CMYK ink on newsprint. A total of 102 newspapers printed a test form on its presses. The test form had 928 color patches. All of the test forms were measured by one very busy spectrophotometer. The data was averaged by patch type, and it became known as CGATS TR 002.

Some of the patches were duplicated on the sheet for quality control. In particular all of the solids were duplicated. Thus, in the blog post, I was dealing with 204 measurements of a magenta solid patch from 102 different newspaper printing presses.

Range for skew and kurtosis

How do we decide when a value of skew or kurtosis is indicative of a non-normal distribution? Skew should be 0.0 for normal variation, but can it be 0.01 and still be normal? Or 0.1? Where is the cutoff?

Consider this: the values for skew and kurtosis that we compute from a data set are just estimates of some metaphysical skew and kurtosis. If we asked all the same printers to submit another data set the following day, we would likely have a somewhat different value of all the statistics. If we had the leisure of collecting a Gillian or a Brilliant or even a vermillion measurements, we would have a more accurate estimate of these statistical measures. 

Luckily some math guy figgered out a simple formula that allows us to put a reliability on the estimates of skew and kurtosis that we compute.

Our estimate of skew has a standard deviation of sqrt (6 / N). For N = 204 (as in our case) this works out to 0.171. So, an estimate of skew that is outside of the range from -0.342 to 0.342 is suspect, and outside the range of -0.513 to 0.513 is very suspect.

For kurtosis, the standard deviation of the estimate is sqrt (24/N), which gives us a range of +/- 0.686 for suspicious and +/- 1.029 for very suspicious.

Tuesday, November 14, 2017

Statistics of multi-dimensional data, theory

This blog post is the culmination of a long series of blog posts on the statistics of color difference data. Most of them just basically said "yeah, normal stats don't work". Lotta help that is, eh? Several blog posts alluded to the fact that I did indeed have a solution. The most recent of which alluded to a method that works in the very title of the blog post: Statistical process control of color, approaching a method that works.

Now it's time to unveil the method.

Generalization of the standard deviation

One way of describing the technique is to call it a generalization of the standard deviation to multiple dimensions -- three dimensions if we are dealing with color data. That's a rather abstract concept, so I will explain.

     One dimensional standard deviation

We can think of our good friends, the standard deviation and mean, as describing a line segment on the number line, as illustrated below. If the data is normally distributed (also called Gaussian, or bell curve), then you would expect that about 68% of the data will fall on the line segment within one standard deviation unit (one sigma) of the mean, 95.45% of the data will fall within two sigma of the mean, and 99.73% of the data will be within three sigma of the mean.

As an aside, note that not all data is normally distributed. This holds true for color difference data, which is the issue that got me started down this path!

So, a one-dimensional standard deviation can be thought of as a line segment that is 2 sigma long, and centered on the mean of the data. It is a one-dimensional summary of all the underlying data.

     Two-dimensional standard deviation

Naturally, a two-dimensional standard deviation is a two-dimensional summary of the underlying two-dimensional data. But instead of a (one-dimensional) line segment, we get an ellipse in two dimensions.

In the simplest case, the two-dimensional standard deviation is a circle (shown in orange below) which is centered on the average of the data points. The circle has a radius of one sigma. If you want to get all mathematical about this, the circle represents a portion of a two-dimensional Gaussian distribution with 39% of the data falling inside the circle, and 61% falling outside.

Two dimensional histogram of a simple set of two dimensional data
The orange line encompasses 39% of the volume.

I slipped a number into that last paragraph that deserves to be underlined: 39%. Back when we were dealing with one-dimensional data, +/- one sigma would encompass 68% of normally distributed data. The number for two-dimensional data is 39%. Toto, I have a feeling we're not in one-dimensional-normal-distribution-ville anymore.

Of course, not all two-dimensional standard deviations are circular like the one in the drawing above. More generally, they will be ellipses. The the length of the semi-major and semi-minor axes of the ellipse are the major and minor standard deviation.

--- Taking a break for a moment

I better stop to review some pesky vocabulary terms. A circle has a radius, which is the distance from the center of the circle to any point on the circle. A circle also has a diameter, which is the distance between opposite points on the circle. The diameter is twice the radius.

When we talk about ellipses, we generally refer to the two axes of the ellipse. The major axis is the longest line segment that goes through the center of the ellipse. The minor axis is the shortest line segment that goes through the center of the ellipse. The lengths of the major and minor axes are essentially the extremes of the diameters of the ellipse. They run perpendicular to each other.

An ellipse, showing off the most gorgeous set of axes I've ever seen

There is no convenient word for the two "radii" of an ellipse. All we have is the inconvenient phrases semi-major axis and semi-major axis. These are half the length of the major and minor axes, respectively.

--- Break over, time to get back to work

The axes of the ellipses won't necessarily be straight up and down and left-to-right on a graph. So, the full description of the two-dimensional standard deviation must include information to identify the orientation of these axes.

The image below shows a set of hypothetical two-dimensional data that has been ellipsified. The red dots are random data that was generated using Mathematica. I asked it to give me 200 normally distributed x data points with a standard deviation of 3, and 200 normally distributed y data points  with a standard deviation of 1. These original data points (the x and y values) were uncorrelated.

This collection of data points were then rotated by 15 degrees so that the new x values had a bit of y in them, and the new y values had a bit of x in them. In other words, there was some correlation (r = 0.6) between the new x and y. I then added 6 to the new x values and 3 to the new y values to move the center of the ellipse. So, the red data points are designed to represent some arbitrary data set that could just happen in real life.

I performed an ellipsification, and have plotted the one, two, and three sigma ellipses (in pink). The major and minor axes of the one sigma ellipse are shown in blue.

Gosh darn! that's purdy!

The result of ellipsifying this data is all the parameters pertaining to the innermost of the ellipses in the image above. This is an ellipse that is centered on {6.11, 3.08}, with major axis of 3.19 and minor axis of 1.00. The ellipse is oriented at 15.8 degrees. These are all rather close to the original parameters that I started with, so I musta done sumthin right.

I also counted the number of data points within the three ellipses. I counted 38.5% in the 1 sigma ellipse, 88.5% in the 2 sigma ellipse, and 99% in the 3 sigma ellipse. (Of course when I say I did this, I really mean that Mathematica gave me a little help.) If the data follows a two-dimensional normal distribution, then the ellipses will encompass 39%, 86.5%, and 98.9% of the data. This is one indication that this condition is met.

The following pieces of information are determined in the ellipsification process of two-dimensional data:

     a) The average of the data which is the center of the ellipse (two numbers, for the horizontal and vertical values)
     b) The orientation of the ellipse (which could be a single number, such as the rotation angle)
     c) The lengths of the semi-major and semi-minor axes of the ellipse (two numbers)

The ellipsification can be described in other ways, but these five numbers will tell me everything about the ellipse. The ellipse is the statistical proxy for the whole data set.

     Three-dimensional standard deviation

The extension to three-dimensional standard deviation is "obvious". (Obvious is a mathematician's way of saying "I don't have the patience to explain this to mere mortals.")

The result of ellipsifying three-dimensional data is the following nine pieces of information that are necessary to describe an arbitrary (three-dimensional) ellipsoid:

    a) The average of the data (three numbers, for the average of the x, y, and z values)
    b) The orientation of the ellipsoid (three numbers defining the direction that the axes point)
    c) The lengths of the semi-major, semi-medial, and semi-minor axes of the ellipse (three numbers)

The image below is an ellipsification of real color data. The data is the color of a solid patch as produced by 102 different newspaper printing presses. There were two samples of this patch from each press, so the number of dots is 204.

The 204 dots were used to compute the three-dimensional standard deviation, which is represented by the three lines. The longest line, the red line, is the major axis of the ellipse, and has a length of 5.6 CIELAB units. The green and blue lines are the medial and minor axes, respectively. They 2.2 and 2.1 CIELAB units long. All three of the axes meet at the mean of all the data points, and all three are two-sigma long (+/-1 sigma from the mean). Depending on the angle you are looking, it may not appear that the axes are all perpendicular to each other, but they are.

Ellipsification of some real data, as shown with the axes of the ellipsoid

Trouble viewing the image above? The image is a .gif image, so you should see it rotating. If it doesn't, then try a different browser, or download it to your computer and view it directly.

What can we do with the ellipsification?

The ellipsification of color data is a three-dimensional version of the standard deviation, so in theory, we can use it for anything that we would use the standard deviation for. The most common use (in the realm of statistical process control) is to decide whether a given data point is an example of typical process variation, or if there is some nefarious agent at work. (Deming would call that a special cause.)

We will see an example of this on real data in the next blog post on the topic: Statistics of multi-dimensional data, example.