Science! Archives

Working with CSV files using PDL

Ahh, the venerable comma separated variable format, beloved of data scientists.

I grabbed a couple of csv files from Matt Pettis’ csvkit talk to prepare for the datafile that I should be getting my mitts on and tripped and bumped my way through the documentation for PDL::IO::CSV and metaphorically skinned my knees, as you do when you don’t read too carefully.

By Any k-Means Necessary

You want to get to know your data, questions like, can they be broken down into a simple set of classes. You don't know what these classes might be, so your task is clustering and you reach for one of the oldest clustering algorithms around k-means.

k-means is popular because it's simple to understand, converges fast, works in higher dimensions and gives you an answer. It's also usually the wrong choice unless you've already got nicely clustered data just waiting for you to guess k, the most appropriate number of clusters to answer your question. But it is a decent warm up exercise in becoming friends with your data set.

About Enkidu

user-pic I am a Freelance Scientist** and Perl is my Igor.