## It's beginning to look a lot like k-Means 🎵

... every where you go! ☃ ☃ ☃

Continuing on from the intention of clustering data in Perl (a form of unsupervised learning), I'm going to start with PDL::Stats::Kmeans and see how far I can get.

Let's just plunge in and use the module
and read the documentation after to figure out what it all **k-means**. (sorry)

If you read the last post you'll see how to import a bunch of cars into a piddle and we can get started. If for some reason you can't get the csv file then just create some random data in the REPL/shell with

```
pdl> $c = random(33,11)
```

which gives you the same shape of data with values in the range [0,1]. If you wanted something a little more like the cars csv, you could do this

```
pdl> $integer = ceil( $c * 100 )
pdl> p $integer
```

which looks nicer

```
[
[65 15 37 51 ...
[53 24 69 87 ...
[ 3 74 31 71 ...
[78 97 67 14 ...
...
]
```

but you'll find that one of the possible steps in the clustering process is normalizing the data to the range [0,1]

So, assuming that you've got the file **mtcars_001.csv**, we're going to just
remove the single header line,
pull in the columns with numeric data (1 .. 11), take any other text values to **BAD**
and turn up the verbosity on the output of reading the file.
Load in the **Stats** module that has the k-means code and,
as promised, *just do it*.

```
pdl> use PDL::IO::CSV ':all'
pdl> $cars = rcsv2D('mtcars_001.csv', [1 .. 11], {text2bad => 1, header => 1, debug => 1});
pdl> use PDL::Stats;
pdl> %k = $cars->kmeans
CNTRD => Null
FULL => 0
NCLUS => 3
NSEED => 32
NTRY => 5
V => 1
overall ss: 623387.464844874
iter 0 R2 [0.79620262 0.76555083 0.78100136 0.78100136 0.75526545]
iter 1 R2 [0.83991216 0.76555083 0.83991216 0.83991216 0.75526545]
iter 2 R2 [ 0.8534725 0.76555083 0.8534725 0.8534725 0.75526545]
```

## ok, what just happened?

The first bit of the output just gives you back the options that were set to their default values:

**NCLUS**which is**k**, the number of clusters to find;**NTRY**- the number of random seeding trials**V**for verbose, prints simple status messages

and you can read the documentation for the rest. For the output you get:

- overall Sum of Squares (ss), a measure of the "goodness" of the classification
- the number of iterations with their R
^{2}values

Examining the R^{2} values for different values of **k** can help us
choose the number of clusters
but I'll leave that for later.

So we now have 3 clusters. What do we know about them?

```
pdl> p %k
n [7 16 9] R2 0.853472495760015 ss
[
[ 29.748571 375.4 76.622222]
[ 5.7142857 13.75 0]
[ 5473.0971 17923.369 27037.556]
[ 6321.4286 14425.75 19490.889]
[0.15857143 1.5681 1.3828]
[0.47163571 5.73901 6.5863242]
[ 9.1904857 48.295744 14.503622]
[ 1.4285714 3 0]
[ 0 3.4375 1.5555556]
[ 0 3.75 6.2222222]
[ 4.8571429 33.9375 24]
]
cluster
[
[0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0]
[1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1]
[0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0]
]
centroid
[
[ 17.014286 24.5 14.644444]
[ 7.4285714 4.625 8]
[ 276.05714 122.29375 388.22222]
[ 150.71429 96.875 232.11111]
[ 2.9942857 4.0025 3.3433333]
[ 3.6014286 2.518 4.1615556]
[ 18.118571 18.543125 16.404444]
[0.28571429 0.75 0]
[ 0 0.6875 0.22222222]
[ 3 4.125 3.4444444]
[ 2.1428571 2.4375 4]
]
```

The keys to this hash are **n**, **R2**, **ss**, **cluster** and **centroid**.

**n** number of members in each cluster and the **R ^{2}** value
was mentioned above.
We have pdls of the membership for each

**cluster**(3x31) and we get 3 centroids of each 11 dimensional cluster

## How does **R**^{2} change with **k**?

^{2}

Simple and straight forward

```
pdl> for ($i = 2; $i <= 10; $i++) {
pdl> %k = $cars->kmeans( { NCLUS => $i, V => 0 } );
pdl> print $i, "\t", $k{R2}, "\n";
pdl> }
```

and yet, doesn't have that clarity that you crave.
I've never been truly satisfied with this method for choosing **k**,
but haven't gotten around to finding a better one.
Yes, other algorithms address this issue, but that doesn't help us now.

Here's a rough and ready plot of my R^{2} values

Maybe I should have normalized?

## For next time

I'll have to finish this after the holiday, but what I'd like to get done is:

- showing cluster membership
- normalization
- displaying a 2D scatter plot of the clusters over 2 dimensions with their centroids
- getting some plain, old stats on each of the dimensions (min, max, variance, etc)

Fascinating...cant wait for your next post.

Thank you for helpful posts. I am keen to learn more about clustering with PDL.

Thank you for helpful posts. I am keen to learn more about clustering with PDL.