k-means: a brief interlude into Data Wrangling
When last we saw our heroes, what they thought was the brink of success turned out to be the precipice of hasty interpretation and now they are dangling for dear life on the branch of normalization! how's that for tortured metaphor!
If you use raw values for your k-means clustering, dimensions with large values or large ranges can swamp smaller dimensions and skew your clusters. The process of normalization tries to bring everything into the same range, usually [0,1], although your choices on how to transform the ranges are also significant. There is not always one best way to do it and, as usual, get familiar with your dataset and use your judgement.
Vital statistics
First, let's get to grips with some basic statistics along the dimensions of the piddle. We have 32 cars with 11 measurements: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb. (no, I don't know what some of these are and I suppose that's the burden of a data scientist upon receiving the data in its wild and untamed state)
pdl> p dims $cars
32 11
You might head for PDL::Stats::Basic but that starts with standard deviations and progresses on to correlation tables. For basic, basic statistics, you want to look in PDL::Ufunc for sum, average, median, min, max, etc.
pdl> $avg = average $cars
pdl> p $avg
[ 20.090625 6.1875 230.72188 146.6875 3.5965625 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125]
pdl>p dims $avg
11
Ok, that looks sane compared with the first line of the data file and we have 11 averages of 11 dimensions. If we'd got it the other way around and found that we had 32 averages, we could do this instead
pdl> $avg = average $cars->xchg(0,1)
Looks like getting the minimum and maximum values are just as easy.
pdl> $min = minimum $cars
pdl> $max = maximum $cars
pdl> p $min
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
pdl> p $max
[33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]
There's even a convenience function that lets you get the minimums, maximums and their respective indices with minmaximum giving the same values as above.
pdl> ($min, $max, $min_ind, $max_ind) = minmaximum $cars
Be careful! Some of those functions like minmax act on the whole piddle.
pdl> ($mn, $mx) = minmax $cars
pdl> p $mn, $mx
0 472
That could be useful for looking a file of unknown origin which you want to get to know or 2D images you want to compare. Maybe you're visualizing the rate of climate change and you have data cubes of daily worldwide temperatures[0] in files separated by year. You'll be glad that PDL has support for bad values
Continuing with the median, mode, first and third quartile.
pdl> $median = medover $cars
pdl> p $median
[19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2]
pdl> $mode = modeover $cars
pdl> p $mode
[15 8 275 110 3 3 17 0 0 3 2]
pdl> $q1 = pctover $cars, 0.25
pdl> $q3 = pctover $cars, 0.75
pdl> p join "\n", $min, $q1, $median, $q3, $max
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
[15.425 4 120.825 96.5 3.08 2.58125 16.8925 0 0 3 2]
[19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2]
[22.8 8 326 180 3.92 3.61 18.9 1 1 4 4]
[33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]
The functions ending with the keyword over compute over the first dimension, not the whole piddle. Of course, pctover with the second argument in the range between 0.0 and 1.0 will give you any percentile you desire. See PDL::Reduce for an alternative interface.
For the standard deviation, variance, etc, it is time to saddle up PDL::Stats::Basic You can play with skew and kurtosis in your own time
pdl> $stdv = stdv $cars
pdl> p $stdv
[ 5.9320296 1.7577951 121.98678 67.483071 0.52625807 0.9630477 1.7588006 0.49607837 0.4911323 0.72618438 1.5897622]
pdl> $variance = var $cars
pdl> p $variance
[ 35.188975 3.0898438 14880.775 4553.9648 0.27694756 0.92746088 3.0933797 0.24609375 0.24121094 0.52734375 2.5273438]
That module also has functions for the standard error, sum of squares, Student's T-test and friends, both biased and unbiased.
All of these numbers displayed out to 8 significant figures is kinda getting to me. I should ask about how to "pretty print" ... later.
Now where was I? Oh yes, Normalization
Let me start with a simple example, so that we know when we're going wrong. So use this as our warm up.
pdl> $x = sequence(6,5)
pdl> p $x
[
[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
]
These values are all positive, so we could just divide all values by the maximum for each dimension.
pdl> $max = $x->maximum
pdl> p $x / $max
PDL: PDL::Ops::divide(a,b,c): Parameter 'b':
Mismatched implicit thread dimension 0: size 6 vs. 5
Oops! The dimensions need to match, so we use the dummy function. (see Chapter 5 of the PDL Book for a good explanation)
pdl> p $x / $max->dummy(0)
[
[ 0 0.2 0.4 0.6 0.8 1]
[0.54545455 0.63636364 0.72727273 0.81818182 0.90909091 1]
[0.70588235 0.76470588 0.82352941 0.88235294 0.94117647 1]
[ 0.7826087 0.82608696 0.86956522 0.91304348 0.95652174 1]
[0.82758621 0.86206897 0.89655172 0.93103448 0.96551724 1]
]
The values in that top row look nicely spread out, but the bottom row is squished into the top 20% of the range.
For better discrimination, we'll subtract the minimum from all values in that dimension to make the range start from zero and divide by the range of values. This has the added benefit of handling negative values with no extra thinking.
pdl> $min = $x->minimum
pdl> p $min, $max
[0 6 12 18 24] [5 11 17 23 29]
pdl> $range = $max - $min
pdl> p $range
[5 5 5 5 5]
pdl> p (( $x - $min->dummy(0) ) / $range->dummy(0) )
[
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
]
z shoots! z-scores!
You high-flyers out there might prefer the Standard score or z-score is a measure of how many standard deviations a value is away from the mean. Care needs to be taken ... yada yada. Care is for other people. We want an equation!
where μ is the mean and σ is the standard deviation which we have from above.
pdl> $avg = average $x
pdl> $sigma = stdv $x
pdl> p $avg
[2.5 8.5 14.5 20.5 26.5]
pdl> p $sigma
[ 1.7078251 1.7078251 1.7078251 1.7078251 1.7078251]
pdl> $z = (($x - $avg->dummy(0))/$sigma->dummy(0))
pdl> p $z
[
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
]
Easy peasy, lemon squeezy. So it's not in the range [0,1] or even [-1,1], but it's regular enough to cluster with. Based on the average and variance of the population, it doesn't suffer as much from outliers affecting most of the data points, like the min/max version. For a skewed distribution, you could try a version using the median and the interquartile range (the difference between the first and third quartiles). Understanding your dataset is the clue to choosing the best strategy.
Waffling on a bit, other use cases have other transforms. This implementation of k-means uses the Euclidean distance to form clusters, but if you were clustering documents using word vectors you'd likely be using the cosine similarity and scaling the vectors by the inverse document-frequency of the terms. This technique weights rare terms higher than common ones in order to eliminate some of the noise associated with clustering in very high dimensions.
... reads documentation
Hey, wait! There was a norm function hiding in the documentation all along.
pdl> p norm $x
[
[ 0 0.13483997 0.26967994 0.40451992 0.53935989 0.67419986]
[0.28252897 0.32961713 0.37670529 0.42379345 0.47088161 0.51796977]
[0.33554129 0.36350307 0.39146484 0.41942662 0.44738839 0.47535017]
[0.35722443 0.37707023 0.39691604 0.41676184 0.43660764 0.45645344]
[0.36896887 0.38434258 0.39971628 0.41508998 0.43046368 0.44583739]
]
What was that?!? Not what I was expecting. backs away slowly, closes door
Now that I know how to normalize a piddle amongst other things, I really should get back to clustering cars. That's what you came here for, right?
- [0] The files from the Climatic Research Unit (UoEA) are in NetCDF format, so you'll want to check out PDL::NetCDF to read them.
My replies to comments on the previous post haven't appeared (found them in the spam bucket), so I'll take this space to continue that discussion as I get time.
Thanks, Saif!
Personally, I warm to the serialized posts and conversational tone of authors like byterock and others. The blogging technique I use is to have a vague idea of what I'd like to accomplish, open up an editor and write as I try and make it work, so you get to see a lot of dead ends.
As for a new PDL Book, that takes a lot of work if you just sit down, research and write it. On the other hand, lots of blogs make a chapter and enough chapters make a book. Dave Cross has tried to make publishing Perl books easier through his Perl School) site. There are no delusions that there is any money in this, but that's not really the reason why we write software.
It could be a collaborative effort, if you're interested?
Hi, I'd have reached out to you by email or GitHub but I can't figure out your ID. If you'd like to help with PDL communication, maybe making a PDL Advent calendar posting, then please comment on https://github.com/PDLPorters/pdl/issues/34 or email to the PDL-general mailing list or me (my CPAN ID is ETJ).
Obviously if you'd like to update or even make a sequel to "the PDL book" (now available in its entirety on the pdl.perl.org website), that would be even better!
We now have the ominously-numbered https://github.com/PDLPorters/pdl/issues/503 to track Advent calendar stuff.