Possibly the best k-means clustering ... in the world!
Short post this time because I got nerd-sniped looking at the data. The fun part is that you quickly move from thinking about how to get your results to trying to work out what they mean.
Forget why I started down this road. Right now, we are seeking the answer to Lewis Carol's famous question, How is a Porsche 914-2 like a Volvo 142E? (well, that's what it was in the first draft) A quick summary for those who have just joined us.
pdl> use PDL::IO::CSV ':all'
pdl> $cars = rcsv2D('mtcars_001.csv', [1 .. 11], {text2bad => 1, header => 1, debug => 1});
pdl> p $cars->dims
32 11
You got 32 11, right?
If you get 31 11 instead, install the latest version of PDL::IO::CSV (at least v0.011) or use the workaround in Working with CSV
Next, get the names into an arrayref so we can identify the members in each cluster.
pdl> ($names) = rcols 'mtcars_001.csv', { COLSEP => ',', LINES => '1:', PERLCOLS => [ 0 ] };
In Data Wrangling I thought that Z-scores might be the best normalisation scheme to even up the differences between ranges. oh, how naïve
pdl> use PDL::Stats; # pulls in stdv and kmeans
pdl> $avg = average $cars
pdl> $sigma = stdv $cars
pdl> $zcars = (($cars - $avg->dummy(0))/$sigma->dummy(0))
(that last variable name was for fans of 60's British cop shows )
Pretty Printing
A recipe for pretty printing the output from Derek Lamb PDL::Core
pdl> p $PDL::doubleformat
%10.8g we don't need 8 decimal places
pdl> $PDL::doubleformat = '%4.2g'
pdl> p $zcars
[
[ 0.15 0.15 0.46 0.22 -0.23 -0.34 -0.98 0.73 0.46 -0.15 -0.39 -0.62 -0.47 -0.82 -1.6 -1.6 -0.91 2.1 1.7 2.3 0.24 -0.77 -0.82 -1.1 -0.15 1.2 1 1.7 -0.72 -0.066 -0.86 0.22]
[ -0.11 -0.11 -1.2 -0.11 1 -0.11 1 -1.2 -1.2 -0.11 -0.11 1 1 1 1 1 1 -1.2 -1.2 -1.2 -1.2 1 1 1 1 -1.2 -1.2 -1.2 1 -0.11 1 -1.2]
[ -0.58 -0.58 -1 0.22 1.1 -0.047 1.1 -0.69 -0.74 -0.52 -0.52 0.37 0.37 0.37 2 1.9 1.7 -1.2 -1.3 -1.3 -0.91 0.72 0.6 0.98 1.4 -1.2 -0.91 -1.1 0.99 -0.7 0.58 -0.9]
[ -0.54 -0.54 -0.8 -0.54 0.42 -0.62 1.5 -1.3 -0.77 -0.35 -0.35 0.49 0.49 0.49 0.86 1 1.2 -1.2 -1.4 -1.2 -0.74 0.049 0.049 1.5 0.42 -1.2 -0.83 -0.5 1.7 0.42 2.8 -0.56]
[ 0.58 0.58 0.48 -0.98 -0.85 -1.6 -0.73 0.18 0.61 0.61 0.61 -1 -1 -1 -1.3 -1.1 -0.7 0.92 2.5 1.2 0.2 -1.6 -0.85 0.25 -0.98 0.92 1.6 0.33 1.2 0.045 -0.11 0.98]
[ -0.62 -0.36 -0.93 -0.0023 0.23 0.25 0.37 -0.028 -0.07 0.23 0.23 0.89 0.53 0.58 2.1 2.3 2.2 -1.1 -1.7 -1.4 -0.78 0.31 0.23 0.65 0.65 -1.3 -1.1 -1.8 -0.049 -0.46 0.37 -0.45]
[ -0.79 -0.47 0.43 0.9 -0.47 1.3 -1.1 1.2 2.9 0.26 0.6 -0.26 -0.14 0.086 0.075 -0.016 -0.24 0.92 0.38 1.2 1.2 -0.56 -0.31 -1.4 -0.45 0.6 -0.65 -0.54 -1.9 -1.3 -1.8 0.43]
[ -0.88 -0.88 1.1 1.1 -0.88 1.1 -0.88 1.1 1.1 1.1 1.1 -0.88 -0.88 -0.88 -0.88 -0.88 -0.88 1.1 1.1 1.1 1.1 -0.88 -0.88 -0.88 -0.88 1.1 -0.88 1.1 -0.88 -0.88 -0.88 1.1]
[ 1.2 1.2 1.2 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 -0.83 1.2 1.2 1.2 -0.83 -0.83 -0.83 -0.83 -0.83 1.2 1.2 1.2 1.2 1.2 1.2 1.2]
[ 0.43 0.43 0.43 -0.95 -0.95 -0.95 -0.95 0.43 0.43 0.43 0.43 -0.95 -0.95 -0.95 -0.95 -0.95 -0.95 0.43 0.43 0.43 -0.95 -0.95 -0.95 -0.95 -0.95 0.43 1.8 1.8 1.8 1.8 1.8 0.43]
[ 0.75 0.75 -1.1 -1.1 -0.51 -1.1 0.75 -0.51 -0.51 0.75 0.75 0.12 0.12 0.12 0.75 0.75 0.75 -1.1 -0.51 -1.1 -1.1 -0.51 -0.51 0.75 -0.51 -1.1 -0.51 -0.51 0.75 2 3.3 -0.51]
]
Now let's cluster again
pdl> %k = $zcars->kmeans # not $cars or you'll wonder why
CNTRD => Null
FULL => 0
NCLUS => 3
NSEED => 32
NTRY => 5
V => 1
overall ss: 352.000000000001
iter 0 R2 [0.59 0.51 0.54 0.39 0.53]
iter 1 R2 [0.61 0.62 0.63 0.55 0.57]
iter 2 R2 [0.62 0.62 0.63 0.58 0.57]
iter 3 R2 [0.62 0.62 0.63 0.6 0.57]
I'm thinking that it's gone through 5 runs of 4 iterations before ending up with a R2 value of [0.57,0.63] is that good?
Look at cluster number one's members
pdl> $cluster = $k{cluster};
pdl> p $cluster(:,0);
[
[0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1]
]
pdl> $c = $k{cluster}
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,0) > 0;}
"Datsun 710""Hornet 4 Drive""Valiant""Merc 240D""Merc 230""Merc 280""Merc 280C""Fiat 128""Honda Civic""Toyota Corolla""Toyota Corona""Fiat X1-9""Porsche 914-2""Lotus Europa""Volvo 142E"
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,1) > 0;}
"Hornet Sportabout""Duster 360""Merc 450SE""Merc 450SL""Merc 450SLC""Cadillac Fleetwood""Lincoln Continental""Chrysler Imperial""Dodge Challenger""AMC Javelin""Camaro Z28""Pontiac Firebird"
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,2) > 0;}
"Mazda RX4""Mazda RX4 Wag""Ford Pantera L""Ferrari Dino""Maserati Bora"
This last group is the smallest. What makes them similar?
At this point I got confused, so I put them in a spreadsheet and added the cluster number and sorted by the cluster (and a bunch of others). After puzzling over the real numbers like a real difference in qsec, what stuck out plain as day was groups 2 and 3 were identical in vs and am the 2 binary attributes. Group 1 (with the Porsche) is a mixed bag. It seems clear to me that those 2 attributes are dominating the clustering the way they didn't in the un-normalised case. Why this happens can be seen if you take the ratio of sigma to average
pdl> p $sigma/$avg
[ 0.3 0.28 0.53 0.46 0.15 0.3 0.099 1.1 1.2 0.2 0.57]
Compare this with the results of Sum of Squares
pdl> p $k{ss}
[
[ 11 2.4 0.95]
[ 3.8 -3.6e-15 1.6]
[ 2.8 3.8 2.5]
[ 1.7 2.7 8.6]
[ 13 2.1 1]
[ 7.4 7 0.61]
[ 9.8 2.3 1.6]
[ 3.8 -3.6e-15 0]
[ 15 0 0]
[ 9.4 -3.6e-15 2.3]
[ 5.7 3.5 5.1]
]
And this is where I got nerd-sniped. You can sympathise.
The zeroes or incredibly small values (e-15 represents 10-15 in computer-ese) belong to the clusters that have absolutely no variation in that dimension. Taking a look back at the csv file and the headings, columns 8 and 9 (vs and am) are binary. Column 10 is the number of gears and one cluster is uniform in the number of gears it has.
A Question of Choice
Now comes the science bit, which attributes are important to us? And that boils down to what question are we trying to answer. (since this was a toy example, I'm going to have to make one up just now, aren't I?)
We have the option to exclude the offending columns by not selecting them in the call to rcsv2D or we could weight the dimensions by how much we feel they should affect the outcome. Because they are different categories of measurement, some binary, some integer, some real, I could use a different normalisation for each scenario. That would give me the opportunity to show off slicing which is part of how PDL works so fast
I think what I'll start with is going back to some sensible normalisation (everything in the range [0,1]) with the realisation that I need to keep my eyes open. Then I'll be able to make a better decision on how I want to proceed.
remembers to look carefully before crossing the street
Tune in Next Week
Leave a comment