By Any k-Means Necessary

You want to get to know your data, questions like, can they be broken down into a simple set of classes. You don't know what these classes might be, so your task is clustering and you reach for one of the oldest clustering algorithms around k-means.

k-means is popular because it's simple to understand, converges fast, works in higher dimensions and gives you an answer. It's also usually the wrong choice unless you've already got nicely clustered data just waiting for you to guess k, the most appropriate number of clusters to answer your question. But it is a decent warm up exercise in becoming friends with your data set.

Where to start?

I think I'll start with k-means and use the results to select a subset of the data. I'll grab the cluster that has the tightest clustering or just grab half of the cluster members that are most similar and see if I can find a partition in that before moving on to look at the whole data set.

Then my Team Lead can pat me on the head, "Oooh, an algorithm! Wait 'til I show the Justice League!"

k-means is a type of unsupervised learning and has been around for a long time It is an example of iterative, distance-based clustering. To use it, we need a metric, a way to measure distance. I'm looking for a daily pattern, so I'll choose to sum the number of events in hourly bins and make each hour a dimension, ignoring any correlations between hours.

I'll normalize the inputs by the maximum value, so that one hour doesn't swamp all the others and use the Euclidean distance as a metric

The original algorithm is sensitive to the initial choice of centroids, so either run it many times until the cluster membership stabilizes or be clever about choosing the initial cluster centers.

Perl modules

I've been looking for an excuse to get PDL out of the box, so let's try out PDL::Stats::Kmeans and see how easy it is. There's a perception that PDL data structures are a bit of a learning curve. You could say the same about Pandas DataFrames, but for those who don't want to invest in the mental model, I will try out Algorithm::KMeans for comparison. It's nice to have options.

Other related modules are:

  • kd-Trees another nearest neighbour algorithm related to k-means

  • k-Means++ solves the problem of choosing the initial cluster centers

  • Expectation Maximization a soft clustering algorithm which gives you a measure of the strength of the membership and allows for membership of multiple clusters.

Missing values

What should happen when we have Bad Data? According to Witten et al. the distance between missing data dimensions should be as large as possible (but no larger, i.e. the worst case scenario). If both objects are missing that data then the distance should scale to the maximum (normalized to 1). If only one is missing, then the distance is either x or (1-x) whichever is larger.

Do these algorithms Do The Right Thing? PDL has support for bad values, but was it written into the k-means clustering code? As a side project, I should construct a couple of test cases and find out which would be easier than code-diving every algorithm I'm interested in.

Spoiler alert

Had a look at PDL::Stats::Kmeans last night and it took me longer to read the documentation than it did to write the clustering code. I was shocked at how easy and fast it was. It definitely needs blogging from the newcomer's POV.

Leave a comment

About Enkidu

user-pic I am a Freelance Scientist** and Perl is my Igor.