No more rhyming and I k-means it!

By Enkidu on January 9, 2020 2:29 PM under Numerical Perl, Science!

"... anybody wanna peanut?" - Fezzik, TPB

When last we saw our heroes, they had just applied PDL::Stats::Kmeans to a CSV file of car data with no thought regarding their own well-being.

In today's episode, we see them slice through data to identify clusters of cars, only to find they know less than they did before!

Read on, true believers!

So, how do we get the names?

We lost the names when reading in the data file using PDL::IO::CSV I poked around the documentation in PDL::IO::Misc until I worked out that I could use rcols

($names) = rcols 'mtcars_001.csv', { COLSEP => ',', LINES => '1:', PERLCOLS => [ 0 ] };

This function reads column data and usually produces piddles, but if you specify PERLCOLS, it will give you back a perl array references. PERLCOLS => [ 0 ] tells it that I want the first column as a perl arrayref, but you can specify any number of columns as a list. This is a CSV file, so we give it COLSEP => ',' to separate the columns and LINES => '1:' to ignore the single header line (by saying, "I want the lines from the second until the end")

rcols populates the variables in the list with the columns, so I could have pulled in the columns separately with

($names, $mpg, $cyl) = rcols 'mtcars_001.csv',
   { COLSEP => ',', LINES => '1:', PERLCOLS => [ 0 ] };

You can get fancier input handling using rgrep. It's not a CSV reader, so I have double quotes around my names that I'll have to deal with (or not).

Now who belongs together?

We have the cluster membership in a piddle. $k{cluster} is a 32x3 (or 3x32, I'll work it out sometime) piddle full of ones and zeroes. We have the names in the $names arrayref. Let's put them together to tell us which cars belong together.

Take a slice

pdl>use PDL::NiceSlice;
pdl>$cluster = $k{cluster};
pdl>p $cluster(:,0);

[
 [0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0]
]

(:,0) is NiceSlice syntax for "give me all the elements in the first dimension". (Now starting to realize that "dimension" is somewhat ambiguous here and I haven't the clarity to explain it other than saying, this slice is now 32x1 which represents membership of one cluster) The idea behind this syntax is to make some convoluted specifications less confusing. It has way more power than you want in the beginning, but it makes the hard things possible. In this simple example, it's almost the same slicing syntax in core PDL. The equivalent command from PDL::Indexing is

pdl>p $k{cluster}->slice(':,0')

The NiceSlice syntax wasn't playing nice with the %k hash. An assignment made that clear. The explicit slice method has no such problems.

Kickin' doors and takin' $names

Let's try some brute force and ignorance. Go through all the indices of one of the clusters and print the $name if the $cluster > 0

pdl>$two = $k{cluster}->slice(':,2')
pdl>for (0 .. @$names -1) {
..{    > print $names->[$_] if $two->index($_) > 0;
..{    > }

"Mazda RX4""Mazda RX4 Wag""Datsun 710""Merc 240D""Merc 230""Merc 280"
"Merc 280C""Fiat 128""Honda Civic""Toyota Corolla""Toyota Corona"
"Fiat X1-9""Porsche 914-2""Lotus Europa""Ferrari Dino""Volvo 142E"

Hey, there's a 2D version of the index command which means that we can skip the slicing. This is a compact form.

pdl>$c = $k{cluster}
pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,0) > 0;}

"Hornet 4 Drive""Valiant""Merc 450SE""Merc 450SL""Merc 450SLC"
"Dodge Challenger""AMC Javelin"

pdl>for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,1) > 0;}

"Hornet Sportabout""Duster 360""Cadillac Fleetwood"
"Lincoln Continental""Chrysler Imperial""Camaro Z28"
"Pontiac Firebird""Ford Pantera L""Maserati Bora"

pdl> for (0 .. $#{$names}){p $names->[$_] if $c->index2d($_,2) > 0;}

"Mazda RX4""Mazda RX4 Wag""Datsun 710""Merc 240D""Merc 230"
"Merc 280""Merc 280C""Fiat 128""Honda Civic""Toyota Corolla"
"Toyota Corona""Fiat X1-9""Porsche 914-2""Lotus Europa"
"Ferrari Dino""Volvo 142E"

Now start wondering what criminal mastermind would put the Porsche in the same cluster with a Volvo, a Fiat and a Honda Civic. Hmmm, it could be the work of the Puzzler!

Or maybe I really do need to normalize.

Tune in next week, True Believers!

5 comments

Tagged as:

Data Science, Perl Data Language

5 Comments

Saif | January 10, 2020 6:06 AM | Reply

I really appreciate these PDL articles of yours. Entertaining and educational and bringing a practical insight into a module that is powerful but with few such real-life examples that one can see. If you wrote a book on it, I would definitely buy it.

zubenel | January 10, 2020 8:50 PM | Reply

I agree with Saif that it is refreshing to see new examples on how to use PDL and a book would be something really cool. Nevertheless, these articles are valuable for me as they do work on Windows and some of the examples from "PDL Book" do not.

Enkidu replied to comment from Saif | January 14, 2020 8:20 AM | Reply

I find the easiest way to blog is to have an editor window open and write what I think and see as it happens which gives it that personal touch. I also like the serialized style that byterock and many others use here. As I said in the first post, I'm just trying to figure out how to do this for myself, but I also realize how much I've learned from other people's posts, so it seems right to share.

As for a book, many posts make a chapter and enough chapters make a book. Dave Cross has been promoting a space for self-publishing with https://perlschool.com/ so this is within the realm of possibility. Would you like to join in?

Enkidu replied to comment from zubenel | January 14, 2020 2:03 PM | Reply

I'd say that if you've found some examples from the PDL Book that don't work on Windows, that you should raise an issue on https://github.com/PDLPorters/pdl-book I was very surprised at how fast some people respond to bug reports.

That being said, a chapter on how to make PDL work with Windows would make any book that much nicer. Do you fancy sharing some of your pain?

zubenel replied to comment from Enkidu | January 27, 2020 4:52 AM | Reply

I have encountered problems trying to visualize results with PDL::Graphics::Simple as written in perlmonks: https://perlmonks.org/?node_id=11111203

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Enkidu

I am a Freelance Scientist** and Perl is my Igor.

More info »

enkidu