## Your benchmarks suck!

Virtually every programmer is obsessed with writing *FAST* code. Curiously, this extends even to those who prefer dynamic languages such as Perl over naturally faster, more low-level languages such as C. Few among us can resist the urge to micro-optimize and a cynical version of myself would claim that the best we can expect is that programmers prove effectiveness of their optimizations with benchmarks.

Wait!

Proof? What I should have written is that they attempt to demonstrate the effect of their optimization by timing it versus another variant. Arriving at a resilient conclusion from a benchmark is hard. It doesn't only take plenty of CPU time, it also takes plenty of brain cycles and experience. People will often publish the result of a simple

```
$ time ./somecommand --options
real 0m2.005s
user 0m2.000s
sys 0m0.005s
$ time ./somecommand --otheroptions
real 0m3.005s
user 0m3.000s
sys 0m0.005s
```

and even if they don't draw a conclusion themselves, they are potentially misleading others. Unfortunately, this situation isn't easily fixable. People (usually) have neither the persistence nor the expertise to do much better. Since this is a pet-peeve of mine, I tried to create an almost-drop-in replacement for "time" in the above incantation that should, on average, produce more conclusive results. It's called *dumbbench* and is available on github only. I claim neither completeness nor correctness.

With *dumbbench*, you trade extra CPU cycles for a statement of the uncertainty on the result and some robustness of the result itself. It doesn't fundamentally solve the problem that in all likeliness *your benchmark doesn't matter*. You now do:

```
$ dumbbench -- ./some-command --options
Ran 23 iterations of the command.
Rejected 3 samples as outliers.
Rounded run time per iteration: 9.519e-01 +/- 3.7e-03 (0.4%)
```

Okay, I admit this is harder to read than the original, but not much. It ran the benchmark 23 times, did some statistics with the results, decided that three of the runs were bad, and then arrived at the conclusion that your code took 0.95 seconds to run. The uncertainty on that measurement is only 0.4%.

Even if you don't care about the details, rest assured that this measurement is likely more reliable *and* it will give others more clues how to interpret your results.

The following essay is taken from the *dumbbench* documentation and goes into more detail why benchmarks suck and how it tries to work around the inevitable. *How it works and why it doesn't...*

# Why it doesn't work and why we try regardless

Recall that the goal is to obtain a reliable estimate of the run-time of a certain operation or command. Now, please realize that this is impossible since the run-time of an operation may depend on many things that can change rapidly: Modern CPUs change their frequency dynamically depending on load. CPU caches may be invalidated at odd moments and page faults provide less fine-grained distration. Naturally, OS kernels will do weird things just to spite you. It's almost hopeless.

Since people (you, I, everybody!) insist on benchmarking anyway, this is a best-effort at estimating the run-time. Naturally, it includes estimating the uncertainty of the run time. This is extremely important for comparing multiple benchmarks and that is usually the ultimate goal. In order to get an estimate of the expectation value and its uncertainty, we need a model of the underlying distribution:

# A model for timing results

Let's take a step back and think about how the run-time of multiple
invocations of the same code will be distributed. Having a qualitative
idea what the distribution of many (**MANY**) measurements looks like is
extremely important for estimating the expectation value and uncertainty
from a sample of few measurements.

In a perfect, deterministic, single-tasking computer, we will get N times the
exact same timing. In the real world, there are at least a million ways that
this assumption is broken on a small scale. For each run, the load of the
computer will be slightly different. The content of main memory and CPU
caches may differ. All of these small effects will make a given run a tiny
bit slower or faster than any other. Thankfully, this is a case where statistics (more precisely
the Central Limit Theorem) provides us with the *qualitative* result: The
measurements will be normally distributed (i.e. following a Gaussian
distribution) around some expectation value (which happens to be the mean in this case).
Good. Unfortunately, benchmarks are more evil than that. In addition to the small-scale
effects that smear the result, there are things that (at the given run time of the benchmark)
may be large enough to cause a large jump in run time. Assuming these are
comparatively rare and typically cause extraordinarily long run-times (as opposed to
extraordinarily low run-times), we arrive at an overall model of
having a central, smoothish normal distribution with a few outliers towards
long run-times.

So in this model, if we perform *N* measurements, almost all *N* times
will be close to the expectation value and a fraction will be significantly higher.
This is troublesome because the outliers create a bias in the uncertainty
estimation and the asymmetry of the overall distribution will bias a simple
calculation of the mean.

What we would like to report to the user is the mean and uncertainty of the main distribution while ignoring the outliers.

Before I go into the details of how we can account for the various complications, let me show you an example of a benchmark result that defies all attempts at automatically arriving at a quantitative result. You know. Just so you don't imagine you're safe if you follow my advice!

In this example, you can see several disjoint distributions, each with its own bit of jitter around it. Possibly, the differences are caused by page faults or CPU frequency changes. I can't tell and that's exactly the point of the example because I'd wager that neither can you!

## A robust estimation of the expectation value

Given the previously discussed model, we estimate the expectation value with the following algorithm:

Calculate the median of the whole distribution. The median is a fairly robust estimator of the expectation value with respect to outliers (assuming they're comparatively rare).

Calculate the median-absolute-deviation from the whole distribution (MAD, see wikipedia). The MAD needs rescaling to become a measure of variability. The MAD will be our initial guess for an uncertainty. Like the median, it is quite robust against outliers.

We use the median and MAD to remove the tails of our distribution. All timings that deviate from the median by more than

*X*times the MAD are rejected. This measure should cut outliers without introducing much bias both in symmetric and asymmetric source distributions.An alternative would be to use an ordinary truncated mean (that is the mean of all timings while disregarding the

*N*largest and*N*smallest results). But the truncated mean can produce a biased result in asymmetric source distributions. The resulting expectation value would be artificially increased.In summary: Using the median as the initial guess for the expectation value and the MAD as the guess for the variability keeps the bias down in the general case.

Finally, the use the mean of the truncated distribution as the expectation value and the MAD of the truncated distribution as a measure of variability. To get the uncertainty on the expectation value, we take

*MAD / sqrt(N)*where*N*is the number of remaining measurements.

## Conclusion

I hope I could convince you that interpreting less sophisticated benchmarks is a dangerous if not futile exercise. The reason this module exists is that not everybody is willing to go through such contortions to arrive at a reliable conclusion, but everybody loves benchmarking. So let's at least get the basics right. Do not compare raw timings of meaningless benchmarks but robust estimates of the run time of meaningless benchmarks instead.

## Disclaimer

This whole rant (and writing the program) was inspired by a recent thread in a certain mailing list. Neither the title nor the content of this post are intended as a slight to anybody involved in the discussion. I'm simply venting long-standing frustration.

This looks really cool, but there's an even deeper underlying problem with many benchmarks. I regularly see benchmarks posted that don't even measure the right thing.

That's not really a problem that can be fixed with code, though ;)

Cool! This was the most constructive frustration venting I've ever seen.

Hi Dave,

I completely agree that the most severe problem is that people benchmark the wrong thing most of the time. That's what I was driving at with

"It doesn't fundamentally solve the problem that in all likeliness your benchmark doesn't matter."

--Steffen

Maybe it does not fundamentally solve the problem of irrelevant benchmarks - but it certainly raises the awareness about the complexities hiding there.

Addendum: Code now available from CPAN as 'Dumbbench'.

This makes me want to make whisker plots of benchmarking results. I wonder if that's easier to do in Perl or if I should just feed the data to R.

Hi brian,

Check out the "boxPlot" branch from the github repository, install the dependencies, and run:

`perl -Ilib script/dumbbench -p 0.001 --code='local $i; $i++ for 1..1e6' --code='local $i; $i++ for 1..1.2e6' --code='local $i; $i++ for 1..1.3e6' --plot_timings`

Then you get something like this.

Unfortunately, this is giant overkill. Installing the SOOT module will auto-install a giant C++ library for data analysis. Sorry.