k-means: a brief interlude into Data Wrangling

When last we saw our heroes, what they thought was the brink of success turned out to be the precipice of hasty interpretation and now they are dangling for dear life on the branch of normalization! how's that for tortured metaphor!

If you use raw values for your k-means clustering, dimensions with large values or large ranges can swamp smaller dimensions and skew your clusters. The process of normalization tries to bring everything into the same range, usually [0,1], although your choices on how to transform the ranges are also significant. There is not always one best way to do it and, as usual, get familiar with your dataset and use your judgement.

Vital statistics

First, let's get to grips with some basic statistics along the dimensions of the piddle. We have 32 cars with 11 measurements: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb. (no, I don't know what some of these are and I suppose that's the burden of a data scientist upon receiving the data in its wild and untamed state)

pdl> p dims $cars 32 11  You might head for PDL::Stats::Basic but that starts with standard deviations and progresses on to correlation tables. For basic, basic statistics, you want to look in PDL::Ufunc for sum, average, median, min, max, etc. pdl>$avg = average $cars pdl> p$avg
[ 20.090625 6.1875  230.72188   146.6875  3.5965625 3.21725   17.84875 0.4375 0.40625 3.6875 2.8125]
pdl>p dims $avg 11  Ok, that looks sane compared with the first line of the data file and we have 11 averages of 11 dimensions. If we'd got it the other way around and found that we had 32 averages, we could do this instead pdl>$avg = average $cars->xchg(0,1)  Looks like getting the minimum and maximum values are just as easy. pdl>$min = minimum $cars pdl>$max = maximum $cars pdl> p$min
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
pdl> p $max [33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]  There's even a convenience function that lets you get the minimums, maximums and their respective indices with minmaximum giving the same values as above. pdl> ($min, $max,$min_ind, $max_ind) = minmaximum$cars


Be careful! Some of those functions like minmax act on the whole piddle.

pdl> ($mn,$mx) = minmax $cars pdl> p$mn, $mx 0 472  That could be useful for looking a file of unknown origin which you want to get to know or 2D images you want to compare. Maybe you're visualizing the rate of climate change and you have data cubes of daily worldwide temperatures[0] in files separated by year. You'll be glad that PDL has support for bad values Continuing with the median, mode, first and third quartile. pdl>$median = medover $cars pdl> p$median
[19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2]

pdl> $mode = modeover$cars
pdl> p $mode [15 8 275 110 3 3 17 0 0 3 2] pdl>$q1 = pctover $cars, 0.25 pdl>$q3 = pctover $cars, 0.75 pdl> p join "\n",$min, $q1,$median, $q3,$max
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
[15.425 4 120.825 96.5 3.08 2.58125 16.8925 0 0 3 2]
[19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2]
[22.8 8 326 180 3.92 3.61 18.9 1 1 4 4]
[33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]


The functions ending with the keyword over compute over the first dimension, not the whole piddle. Of course, pctover with the second argument in the range between 0.0 and 1.0 will give you any percentile you desire. See PDL::Reduce for an alternative interface.

For the standard deviation, variance, etc, it is time to saddle up PDL::Stats::Basic You can play with skew and kurtosis in your own time

pdl> $stdv = stdv$cars
pdl> p $stdv [ 5.9320296 1.7577951 121.98678 67.483071 0.52625807 0.9630477 1.7588006 0.49607837 0.4911323 0.72618438 1.5897622] pdl>$variance = var $cars pdl> p$variance
[ 35.188975  3.0898438  14880.775  4553.9648 0.27694756 0.92746088  3.0933797 0.24609375 0.24121094 0.52734375  2.5273438]


That module also has functions for the standard error, sum of squares, Student's T-test and friends, both biased and unbiased.

All of these numbers displayed out to 8 significant figures is kinda getting to me. I should ask about how to "pretty print" ... later.

Now where was I? Oh yes, Normalization

Let me start with a simple example, so that we know when we're going wrong. So use this as our warm up.

pdl> $x = sequence(6,5) pdl> p$x
[
[ 0  1  2  3  4  5]
[ 6  7  8  9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
]


These values are all positive, so we could just divide all values by the maximum for each dimension.

pdl> $max =$x->maximum
pdl> p $x /$max
PDL: PDL::Ops::divide(a,b,c): Parameter 'b':
Mismatched implicit thread dimension 0: size 6 vs. 5


Oops! The dimensions need to match, so we use the dummy function. (see Chapter 5 of the PDL Book for a good explanation)

pdl> p $x /$max->dummy(0)
[
[         0        0.2        0.4        0.6        0.8          1]
[0.54545455 0.63636364 0.72727273 0.81818182 0.90909091          1]
[0.70588235 0.76470588 0.82352941 0.88235294 0.94117647          1]
[ 0.7826087 0.82608696 0.86956522 0.91304348 0.95652174          1]
[0.82758621 0.86206897 0.89655172 0.93103448 0.96551724          1]
]


The values in that top row look nicely spread out, but the bottom row is squished into the top 20% of the range.

For better discrimination, we'll subtract the minimum from all values in that dimension to make the range start from zero and divide by the range of values. This has the added benefit of handling negative values with no extra thinking.

pdl> $min =$x->minimum
pdl> p $min,$max
[0 6 12 18 24] [5 11 17 23 29]

pdl> $range =$max - $min pdl> p$range
[5 5 5 5 5]

pdl> p (( $x -$min->dummy(0) ) / $range->dummy(0) ) [ [ 0 0.2 0.4 0.6 0.8 1] [ 0 0.2 0.4 0.6 0.8 1] [ 0 0.2 0.4 0.6 0.8 1] [ 0 0.2 0.4 0.6 0.8 1] [ 0 0.2 0.4 0.6 0.8 1] ]  z shoots! z-scores! You high-flyers out there might prefer the Standard score or z-score is a measure of how many standard deviations a value is away from the mean. Care needs to be taken ... yada yada. Care is for other people. We want an equation! $\Large&space;z=\frac{x-\mu}{\sigma}$ where μ is the mean and σ is the standard deviation which we have from above. pdl>$avg = average $x pdl>$sigma = stdv $x pdl> p$avg
[2.5 8.5 14.5 20.5 26.5]
pdl> p $sigma [ 1.7078251 1.7078251 1.7078251 1.7078251 1.7078251] pdl>$z = (($x -$avg->dummy(0))/$sigma->dummy(0)) pdl> p$z
[
[ -1.4638501 -0.87831007 -0.29277002  0.29277002  0.87831007   1.4638501]
[ -1.4638501 -0.87831007 -0.29277002  0.29277002  0.87831007   1.4638501]
[ -1.4638501 -0.87831007 -0.29277002  0.29277002  0.87831007   1.4638501]
[ -1.4638501 -0.87831007 -0.29277002  0.29277002  0.87831007   1.4638501]
[ -1.4638501 -0.87831007 -0.29277002  0.29277002  0.87831007   1.4638501]
]


Easy peasy, lemon squeezy. So it's not in the range [0,1] or even [-1,1], but it's regular enough to cluster with. Based on the average and variance of the population, it doesn't suffer as much from outliers affecting most of the data points, like the min/max version. For a skewed distribution, you could try a version using the median and the interquartile range (the difference between the first and third quartiles). Understanding your dataset is the clue to choosing the best strategy.

Waffling on a bit, other use cases have other transforms. This implementation of k-means uses the Euclidean distance to form clusters, but if you were clustering documents using word vectors you'd likely be using the cosine similarity and scaling the vectors by the inverse document-frequency of the terms. This technique weights rare terms higher than common ones in order to eliminate some of the noise associated with clustering in very high dimensions.

Hey, wait! There was a norm function hiding in the documentation all along.

pdl> p norm \$x

[
[         0 0.13483997 0.26967994 0.40451992 0.53935989 0.67419986]
[0.28252897 0.32961713 0.37670529 0.42379345 0.47088161 0.51796977]
[0.33554129 0.36350307 0.39146484 0.41942662 0.44738839 0.47535017]
[0.35722443 0.37707023 0.39691604 0.41676184 0.43660764 0.45645344]
[0.36896887 0.38434258 0.39971628 0.41508998 0.43046368 0.44583739]
]


What was that?!? Not what I was expecting. backs away slowly, closes door

Now that I know how to normalize a piddle amongst other things, I really should get back to clustering cars. That's what you came here for, right?

• [0] The files from the Climatic Research Unit (UoEA) are in NetCDF format, so you'll want to check out PDL::NetCDF to read them.