No more rhyming and I k-means it!

"... anybody wanna peanut?" - Fezzik, TPB

When last we saw our heroes, they had just applied PDL::Stats::Kmeans to a CSV file of car data with no thought regarding their own well-being.

In today's episode, we see them slice through data to identify clusters of cars, only to find they know less than they did before!

Read on, true believers!

So, how do we get the names?

We lost the names when reading in the data file using PDL::IO::CSV I poked around the documentation in PDL::IO::Misc until I worked out that I could use rcols

($names) = rcols 'mtcars_001.csv', { COLSEP => ',', LINES => '1:', PERLCOLS => [ 0 ] };

This function reads column data and usually produces piddles, but if you specify PERLCOLS, it will give you back a perl array references. PERLCOLS => [ 0 ] tells it that I want the first column as a perl arrayref, but you can specify any number of columns as a list. This is a CSV file, so we give it COLSEP => ',' to separate the columns and LINES => '1:' to ignore the single header line (by saying, "I want the lines from the second until the end")

rcols populates the variables in the list with the columns, so I could have pulled in the columns separately with

($names, $mpg, $cyl) = rcols 'mtcars_001.csv',
   { COLSEP => ',', LINES => '1:', PERLCOLS => [ 0 ] };

You can get fancier input handling using rgrep. It's not a CSV reader, so I have double quotes around my names that I'll have to deal with (or not).

Now who belongs together?

We have the cluster membership in a piddle. $k{cluster} is a 32x3 (or 3x32, I'll work it out sometime) piddle full of ones and zeroes. We have the names in the $names arrayref. Let's put them together to tell us which cars belong together.

Take a slice

pdl>use PDL::NiceSlice;
pdl>$cluster = $k{cluster};
pdl>p $cluster(:,0);

[
 [0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0]
]

(:,0) is NiceSlice syntax for "give me all the elements in the first dimension". (Now starting to realize that "dimension" is somewhat ambiguous here and I haven't the clarity to explain it other than saying, this slice is now 32x1 which represents membership of one cluster) The idea behind this syntax is to make some convoluted specifications less confusing. It has way more power than you want in the beginning, but it makes the hard things possible. In this simple example, it's almost the same slicing syntax in core PDL. The equivalent command from PDL::Indexing is

pdl>p $k{cluster}->slice(':,0')

The NiceSlice syntax wasn't playing nice with the %k hash. An assignment made that clear. The explicit slice method has no such problems.

Kickin' doors and takin' $names

Let's try some brute force and ignorance. Go through all the indices of one of the clusters and print the $name if the $cluster > 0

pdl>$two = $k{cluster}->slice(':,2')
pdl>for (0 .. @$names -1) {
..{    > print $names->[$_] if $two->index($_) > 0;
..{    > }

"Mazda RX4""Mazda RX4 Wag""Datsun 710""Merc 240D""Merc 230""Merc 280"
"Merc 280C""Fiat 128""Honda Civic""Toyota Corolla""Toyota Corona"
"Fiat X1-9""Porsche 914-2""Lotus Europa""Ferrari Dino""Volvo 142E"

Hey, there's a 2D version of the index command which means that we can skip the slicing. This is a compact form.

pdl>$c = $k{cluster}
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,0) > 0;}

"Hornet 4 Drive""Valiant""Merc 450SE""Merc 450SL""Merc 450SLC"
"Dodge Challenger""AMC Javelin"

pdl>for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,1) > 0;}

"Hornet Sportabout""Duster 360""Cadillac Fleetwood"
"Lincoln Continental""Chrysler Imperial""Camaro Z28"
"Pontiac Firebird""Ford Pantera L""Maserati Bora"

pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,2) > 0;}

"Mazda RX4""Mazda RX4 Wag""Datsun 710""Merc 240D""Merc 230"
"Merc 280""Merc 280C""Fiat 128""Honda Civic""Toyota Corolla"
"Toyota Corona""Fiat X1-9""Porsche 914-2""Lotus Europa"
"Ferrari Dino""Volvo 142E"

Now start wondering what criminal mastermind would put the Porsche in the same cluster with a Volvo, a Fiat and a Honda Civic. Hmmm, it could be the work of the Puzzler!

Or maybe I really do need to normalize.

Tune in next week, True Believers!

5 Comments

I really appreciate these PDL articles of yours. Entertaining and educational and bringing a practical insight into a module that is powerful but with few such real-life examples that one can see. If you wrote a book on it, I would definitely buy it.

I agree with Saif that it is refreshing to see new examples on how to use PDL and a book would be something really cool. Nevertheless, these articles are valuable for me as they do work on Windows and some of the examples from "PDL Book" do not.

I have encountered problems trying to visualize results with PDL::Graphics::Simple as written in perlmonks: https://perlmonks.org/?node_id=11111203

Leave a comment

About Enkidu

user-pic I am a Freelance Scientist** and Perl is my Igor.