k-Means
k-Means-er
As we take another lap around the k-Means race trace, the Porsche 914-2 and Volvo 142E are still neck and neck. This time we'll try a straight-forward normalisation that linearly scales all values to the range [0,1] and see if they still end up in the same cluster.
Curiosity finally got the better of me, so I looked up both of those models and they are actually quite similar cars from the early 1970s. Would I have dug so deep if I hadn't has that misconception about what I thought the clustering should have produced? Probably not.
Gentlefolk, start your engines!
pdl> use PDL::IO::CSV ':all'
pdl> $cars = rcsv2D('mtcars_001.csv', [1 .. 11], {text2bad => 1, header => 1, debug => 1});
pdl> p $cars->dims
32 11
pdl> ($names) = rcols 'mtcars_001.csv', { COLSEP => ',', LINES => '1:', PERLCOLS => [ 0 ] };
pdl> $PDL::doubleformat = '%4.2g'
pdl> ($min, $max, $min_ind, $max_ind) = minmaximum $cars
pdl> p $min, $max
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
[33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]
pdl> $range = $max - $min
pdl> p $range
pdl> $ncars = ( $cars - $min->dummy(0) ) / $range->dummy(0)
[
[ 0.45 0.45 0.53 0.47 0.35 0.33 0.17 ...
[ 0.5 0.5 0 0.5 1 0.5 1 ...
[ 0.22 0.22 0.092 0.47 0.72 0.38 0.72 ...
[ 0.2 0.2 0.14 0.2 0.43 0.19 0.68 ...
[ 0.53 0.53 0.5 0.15 0.18 0 0.21 ...
[ 0.28 0.35 0.21 0.44 0.49 0.5 0.53 ...
[ 0.23 0.3 0.49 0.59 0.3 0.68 0.16 ...
[ 0 0 1 1 0 1 0 ...
[ 1 1 1 0 0 0 0 ...
[ 0.5 0.5 0.5 0 0 0 0 ...
[ 0.43 0.43 0 0 0.14 0 0.43 ...
]
There we go, nice neat data ranges all ready for clustering (or presenting to a neural network - whatever floats your boat) You can see the binary attributes sticking out on rows 8 and 9. That and the rows below them are the ones we'll pay attention to this time.
Are you ready to cluster?
pdl> use PDL::Stats
pdl> %k = $ncars->kmeans # not $cars or you'll wonder why
CNTRD => Null
FULL => 0
NCLUS => 3
NSEED => 32
NTRY => 5
V => 1
overall ss: 39.6893183665554
iter 0 R2 [ 0.6 0.65 0.66 0.44 0.52]
iter 1 R2 [0.69 0.67 0.66 0.46 0.62]
iter 2 R2 [0.69 0.68 0.66 0.46 0.69]
iter 3 R2 [0.69 0.69 0.66 0.46 0.69]
You can see how for each iteration, the value of R2 never decreases, although not every try hits the high values. k-Means is dependant on the starting positions of the centroids. Using a number of tries with randomly chosen initial positions helps to eliminate bad clusterings. The best value for R2 on the [0,1] normalisation is 0.69, whereas the best found for the z-scores last time was 0.63. Not too different, but maybe a little better?
Look at cluster number one's members
pdl> $c = $k{cluster}
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,0) > 0;}
"Datsun 710""Hornet 4 Drive""Valiant""Merc 240D""Merc 230""Merc 280""Merc 280C""Fiat 128""Honda Civic""Toyota Corolla""Toyota Corona""Fiat X1-9""Lotus Europa""Volvo 142E"
WAIT!?! Where did the Porsche go?
This cluster is identical to the first cluster found using the z-scores normalisation except for the Porsche. After I've gone to all that trouble to prove to myself that the Porsche and the Volvo are really quite similar, that I should rid myself of pre-conceptions and now they aren't? Someone's messing with me.
Alright, where did it go?
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,1) > 0;}
"Mazda RX4""Mazda RX4 Wag""Porsche 914-2""Ford Pantera L""Ferrari Dino""Maserati Bora"
Ahh, in with the high-powered cars
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,2) > 0;}
"Hornet Sportabout""Duster 360""Merc 450SE""Merc 450SL""Merc 450SLC""Cadillac Fleetwood""Lincoln Continental""Chrysler Imperial""Dodge Challenger""AMC Javelin""Camaro Z28""Pontiac Firebird"
And this cluster is the same in both normalisation's.
We are still clustering by binary attributes. I could reduce their effect by scaling the columns by arbitrary weights, but I still don't know the data well enough to choose the weights. The Porsche is flitting back and forth between the two groups. I would expect it to be near the boundary of two groups and sensitive to the normalisation chosen.
Slicing through the confusion
Are you getting curious as to the meaning of the columns yet? Here's an explanation of R's mtcars dataset It was used as a classroom exercise to explore the relationships between a set of variables and the fuel efficiency, but today I think I'll just see what cars group together in terms of power, speed and economy. Without thinking too hard about it, I'll choose mpg, disp, hp, drat, wt and qsec So, how would I go about only selecting these columns? NiceSlice syntax is easy for simple ranges, but I want column 1 as well as columns 3 through 7. Well, NiceSlice also accepts piddles like this
pdl> $selected = pdl(0,2..6)
pdl> %k2 = $ncars(:,$selected)->kmeans
CNTRD => Null
FULL => 0
NCLUS => 3
NSEED => 32
NTRY => 5
V => 1
overall ss: 12.0466206624738
iter 0 R2 [0.65 0.63 0.63 0.63 0.54]
iter 1 R2 [0.65 0.63 0.65 0.64 0.65]
iter 2 R2 [0.65 0.63 0.65 0.66 0.65]
iter 3 R2 [0.65 0.63 0.65 0.67 0.65]
iter 4 R2 [0.65 0.63 0.65 0.68 0.65]
The output looks pretty much the same. We may have lost
a smidge of R2, but the clusters have changed
and p $k2{ss}
will confirm that you've only clustered
on 6 attributes.
How does this make the clusters look?
pdl> $c2 = $k2{cluster}
pdl> for (0 .. $#{$names}){p $names->[$_] if $c2->index2d($_,0) > 0;}
"Fiat 128""Honda Civic""Toyota Corolla""Fiat X1-9""Porsche 914-2""Lotus Europa"
"Mazda RX4""Mazda RX4 Wag""Datsun 710""Hornet 4 Drive""Valiant""Merc 240D""Merc 230""Merc 280""Merc 280C""Toyota Corona""Ferrari Dino""Volvo 142E"
"Hornet Sportabout""Duster 360""Merc 450SE""Merc 450SL""Merc 450SLC""Cadillac Fleetwood""Lincoln Continental""Chrysler Imperial""Dodge Challenger""AMC Javelin""Camaro Z28""Pontiac Firebird""Ford Pantera L""Maserati Bora"
The Porsche, Volvo and Maserati are all in separate groups now.
Satisfied?
Seeing the wood for the trees
We never checked if the data was suitable for this clustering method. Now that you've gotten familiar with the mechanics of using this algorithm, have a look at the drawbacks of using k-means It makes more sense and you appreciate when to use it and when to look for something else.
Looks like I should examine the variance of the attributes (that might help make the scaling choices) and finally get around to visualising the clusters. It's the only way to get my head around who's together. Remember also that using three clusters was an arbitrary choice.
and the scenery changes from black and white to colour
Oh, Toto, I don't think we're in 2 dimensions any more. The Wizard of Visualisation will give us what we need.
It seems that you are getting some meaningful clusters with kmeans! From your post it looks that you have knowledge and drive to decipher PDL::Stats::Kmeans. Reading this I thought a little about your previous post as you wrote that you love an idea of a new PDL Book. I guess it does not need to be something big or ambitious. Just gluing your posts would make a pretty interesting ride to kmeans clustering.
I've tried to use k-means before and came off the worse, so you could see this as Rocky II. This time I'm going the distance!
I think Dave Cross suggested to me that an e-book could be like a short story, only as long as it needs to be to complete an idea. Or you could think of it as an Agile approach to authoring. I'm just trying to keep to a schedule of one post every 2 weeks and an investment of 3 hours effort (which explains the glib and haphazard style :)