As we take another lap around the k-Means race trace,
the Porsche 914-2 and Volvo 142E
are still neck and neck.
This time we'll try a straight-forward normalisation
that linearly scales all values to the range [0,1]
and see if they still end up in the same cluster.
Short post this time because I got
nerd-sniped
looking at the data. The fun part is that you quickly
move from thinking about how to get your results to trying
to work out what they mean.
Forget why I started down this road. Right now, we
are seeking the answer to Lewis Carol's famous question,
How is a Porsche 914-2 like a Volvo 142E?
(well, that's what it was in the first draft)
A quick summary for those who have just joined us.
pdl> use PDL::IO::CSV ':all'
pdl> $cars = rcsv2D('mtcars_001.csv', [1 .. 11], {text2bad => 1, header => 1, debug => 1});
pdl> p $cars->dims
32 11
You got 32 11, right?
The title is clickbait. I ran short of time this week and am ~~recycling~~^Wconsolidating comments, replies and thoughts. Let's talk about Books!
I would love a new PDL Book. One that's completely different from the original to maximize the surface of engagement to a new audience. As a "sequel", It would have the advantage of being able to refer the reader to the first book for longer explanations and be able to jump right into how to solve significant problems. brian d foy has just finished his Mojolicious book, so I bet he's got loads of free time on his hands. (although I remember him in the middle of writing it in 2018, so you may have to wait a bit)
When
last we saw our heroes,
what they thought was
the brink of success turned out to be the precipice
of hasty interpretation and now they are dangling
for dear life on the branch of normalization!
how's that for tortured metaphor!
If you use raw values for your
k-means clustering,
dimensions with large values or large ranges can
swamp smaller dimensions and skew your clusters.
The process of normalization tries to bring everything
into the same range, usually [0,1], although your choices
on how to transform the ranges are also significant.
There is not always one best way to do it and,
as usual, get familiar with your dataset and use
your judgement.
"... anybody wanna peanut?" - Fezzik, TPB
When
last we saw our heroes,
they had just applied
PDL::Stats::Kmeans
to a CSV file of car data with no thought regarding their own well-being.
In today's episode, we see them slice
through data to identify clusters of cars, only to find they know less than they did before!
Read on, true believers!
... every where you go! ☃ ☃ ☃
Continuing on from the
intention
of clustering data in Perl (a form of unsupervised learning), I'm going to start with
PDL::Stats::Kmeans
and see how far I can get.
Let's just plunge in and use the module
and read the documentation after to figure out what it all k-means. (sorry)
Ahh, the venerable comma separated variable format, beloved of data scientists.
I grabbed a couple of csv files from Matt Pettis’
csvkit talk
to prepare for the datafile that I should be getting my mitts on and
tripped and bumped my way through the documentation for
PDL::IO::CSV
and metaphorically skinned my knees,
as you do when you don’t read too carefully.
You want to get to know your data, questions like,
can they be broken down into a simple set of classes.
You don't know what these classes might be, so your
task is clustering and you reach for one of the
oldest clustering algorithms around k-means.
k-means is popular because it's simple to understand,
converges fast, works in higher dimensions
and gives you an answer.
It's also usually the wrong choice unless you've
already got nicely clustered data just waiting for you
to guess k, the most appropriate number of clusters
to answer your question. But it is a decent warm up
exercise in becoming friends with your data set.