January 2020 Archives

k-means: a brief interlude into Data Wrangling

When last we saw our heroes, what they thought was the brink of success turned out to be the precipice of hasty interpretation and now they are dangling for dear life on the branch of normalization! how's that for tortured metaphor!

If you use raw values for your k-means clustering, dimensions with large values or large ranges can swamp smaller dimensions and skew your clusters. The process of normalization tries to bring everything into the same range, usually [0,1], although your choices on how to transform the ranges are also significant. There is not always one best way to do it and, as usual, get familiar with your dataset and use your judgement.

No more rhyming and I k-means it!

"... anybody wanna peanut?" - Fezzik, TPB

When last we saw our heroes, they had just applied PDL::Stats::Kmeans to a CSV file of car data with no thought regarding their own well-being.

In today's episode, we see them slice through data to identify clusters of cars, only to find they know less than they did before!

Read on, true believers!

About Enkidu

user-pic I am a Freelance Scientist** and Perl is my Igor.