## k-means: a brief interlude into Data Wrangling

When
last we saw our heroes,
what they thought was
the brink of success turned out to be the precipice
of hasty interpretation and now they are dangling
for dear life on the branch of normalization!
*how's that for tortured metaphor!*

If you use raw values for your
k-means clustering,
dimensions with large values or large ranges can
swamp smaller dimensions and skew your clusters.
The process of normalization tries to bring everything
into the same range, usually [0,1], although your choices
on how to transform the ranges are also significant.
There is not always one *best* way to do it and,
as usual, get familiar with your dataset and use
your judgement.

## Vital statistics

First, let's get to grips with some basic statistics along the dimensions of the piddle. We have 32 cars with 11 measurements: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb. (no, I don't know what some of these are and I suppose that's the burden of a data scientist upon receiving the data in its wild and untamed state)

```
pdl> p dims $cars
32 11
```

You might head for
PDL::Stats::Basic
but that starts with standard deviations and progresses on to correlation tables.
For basic, *basic* statistics, you want to look in
PDL::Ufunc
for sum, average, median, min, max, etc.

```
pdl> $avg = average $cars
pdl> p $avg
[ 20.090625 6.1875 230.72188 146.6875 3.5965625 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125]
pdl>p dims $avg
11
```

Ok, that looks sane compared with the first line of the data file and we have 11 averages of 11 dimensions. If we'd got it the other way around and found that we had 32 averages, we could do this instead

```
pdl> $avg = average $cars->xchg(0,1)
```

Looks like getting the minimum and maximum values are just as easy.

```
pdl> $min = minimum $cars
pdl> $max = maximum $cars
pdl> p $min
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
pdl> p $max
[33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]
```

There's even a convenience function that lets you get the minimums, maximums
and their respective indices with **minmaximum** giving the same values
as above.

```
pdl> ($min, $max, $min_ind, $max_ind) = minmaximum $cars
```

Be careful! Some of those functions like **minmax** act on the whole piddle.

```
pdl> ($mn, $mx) = minmax $cars
pdl> p $mn, $mx
0 472
```

That could be useful for looking a file of unknown origin which you want to get to know or 2D images you want to compare. Maybe you're visualizing the rate of climate change and you have data cubes of daily worldwide temperatures[0] in files separated by year. You'll be glad that PDL has support for bad values

Continuing with the median, mode, first and third quartile.

```
pdl> $median = medover $cars
pdl> p $median
[19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2]
pdl> $mode = modeover $cars
pdl> p $mode
[15 8 275 110 3 3 17 0 0 3 2]
pdl> $q1 = pctover $cars, 0.25
pdl> $q3 = pctover $cars, 0.75
pdl> p join "\n", $min, $q1, $median, $q3, $max
[10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1]
[15.425 4 120.825 96.5 3.08 2.58125 16.8925 0 0 3 2]
[19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2]
[22.8 8 326 180 3.92 3.61 18.9 1 1 4 4]
[33.9 8 472 335 4.93 5.424 22.9 1 1 5 8]
```

The functions ending with the keyword **over** compute *over* the first dimension,
not the whole piddle.
Of course, **pctover** with the second argument
in the range between 0.0 and 1.0
will give you any percentile you desire.
See
PDL::Reduce
for an alternative interface.

For the standard deviation, variance, etc,
it is time to saddle up
PDL::Stats::Basic
You can play with **skew** and **kurt**osis in
your
own
time

```
pdl> $stdv = stdv $cars
pdl> p $stdv
[ 5.9320296 1.7577951 121.98678 67.483071 0.52625807 0.9630477 1.7588006 0.49607837 0.4911323 0.72618438 1.5897622]
pdl> $variance = var $cars
pdl> p $variance
[ 35.188975 3.0898438 14880.775 4553.9648 0.27694756 0.92746088 3.0933797 0.24609375 0.24121094 0.52734375 2.5273438]
```

That module also has functions for the standard error, sum of squares, Student's T-test and friends, both biased and unbiased.

All of these numbers displayed out to 8 significant figures is kinda getting to me. I should ask about how to "pretty print" ... later.

## Now where was I? Oh yes, Normalization

Let me start with a simple example, so that we know when we're going wrong. So use this as our warm up.

```
pdl> $x = sequence(6,5)
pdl> p $x
[
[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
]
```

These values are all positive, so we could just divide all values by the maximum for each dimension.

```
pdl> $max = $x->maximum
pdl> p $x / $max
PDL: PDL::Ops::divide(a,b,c): Parameter 'b':
Mismatched implicit thread dimension 0: size 6 vs. 5
```

Oops! The dimensions need to match, so we use the
**dummy**
function. (see Chapter 5 of the
PDL Book
for a good explanation)

```
pdl> p $x / $max->dummy(0)
[
[ 0 0.2 0.4 0.6 0.8 1]
[0.54545455 0.63636364 0.72727273 0.81818182 0.90909091 1]
[0.70588235 0.76470588 0.82352941 0.88235294 0.94117647 1]
[ 0.7826087 0.82608696 0.86956522 0.91304348 0.95652174 1]
[0.82758621 0.86206897 0.89655172 0.93103448 0.96551724 1]
]
```

The values in that top row look nicely spread out, but the bottom row is squished into the top 20% of the range.

For better discrimination, we'll subtract the minimum from all values in that dimension to make the range start from zero and divide by the range of values. This has the added benefit of handling negative values with no extra thinking.

```
pdl> $min = $x->minimum
pdl> p $min, $max
[0 6 12 18 24] [5 11 17 23 29]
pdl> $range = $max - $min
pdl> p $range
[5 5 5 5 5]
pdl> p (( $x - $min->dummy(0) ) / $range->dummy(0) )
[
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
[ 0 0.2 0.4 0.6 0.8 1]
]
```

### z shoots! z-scores!

You high-flyers out there might prefer the
Standard score
or **z-score** is a measure of how many standard deviations
a value is away from the mean. Care needs to be taken ... yada yada.
Care is for other people. We want an equation!

where μ is the mean and σ is the standard deviation which we have from above.

```
pdl> $avg = average $x
pdl> $sigma = stdv $x
pdl> p $avg
[2.5 8.5 14.5 20.5 26.5]
pdl> p $sigma
[ 1.7078251 1.7078251 1.7078251 1.7078251 1.7078251]
pdl> $z = (($x - $avg->dummy(0))/$sigma->dummy(0))
pdl> p $z
[
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
[ -1.4638501 -0.87831007 -0.29277002 0.29277002 0.87831007 1.4638501]
]
```

Easy peasy, lemon squeezy. So it's not in the range [0,1] or even [-1,1], but it's regular enough to cluster with. Based on the average and variance of the population, it doesn't suffer as much from outliers affecting most of the data points, like the min/max version. For a skewed distribution, you could try a version using the median and the interquartile range (the difference between the first and third quartiles). Understanding your dataset is the clue to choosing the best strategy.

Waffling on a bit, other use cases have other transforms. This implementation of k-means uses the Euclidean distance to form clusters, but if you were clustering documents using word vectors you'd likely be using the cosine similarity and scaling the vectors by the inverse document-frequency of the terms. This technique weights rare terms higher than common ones in order to eliminate some of the noise associated with clustering in very high dimensions.

*... reads documentation*

Hey, wait! There was a **norm** function hiding in the documentation all along.

```
pdl> p norm $x
[
[ 0 0.13483997 0.26967994 0.40451992 0.53935989 0.67419986]
[0.28252897 0.32961713 0.37670529 0.42379345 0.47088161 0.51796977]
[0.33554129 0.36350307 0.39146484 0.41942662 0.44738839 0.47535017]
[0.35722443 0.37707023 0.39691604 0.41676184 0.43660764 0.45645344]
[0.36896887 0.38434258 0.39971628 0.41508998 0.43046368 0.44583739]
]
```

What was **that?!?** Not what I was expecting. *backs away slowly, closes door*

Now that I know how to normalize a piddle amongst other things, I really should get back to clustering cars. That's what you came here for, right?

- [0] The files from the Climatic Research Unit (UoEA) are in NetCDF format, so you'll want to check out PDL::NetCDF to read them.

My replies to comments on the previous post haven't appeared (found them in the spam bucket), so I'll take this space to continue that discussion as I get time.

Thanks, Saif!

Personally, I warm to the serialized posts and conversational tone of authors like byterock and others. The blogging technique I use is to have a vague idea of what I'd like to accomplish, open up an editor and write as I try and make it work, so you get to see a lot of dead ends.

As for a new PDL Book, that takes a lot of work if you just sit down, research and write it. On the other hand, lots of blogs make a chapter and enough chapters make a book. Dave Cross has tried to make publishing Perl books easier through his Perl School) site. There are no delusions that there is any money in this, but that's not really the reason why we write software.

It could be a collaborative effort, if you're interested?