# statistics Archives

## The physicist's way out

Previously, I wrote about modeling the result of repeated benchmarks. It turns out that this isn't easy. Different effects are important when you benchmark run times of different magnitudes. The previous example ran for about 0.05 seconds. That's an eternity for computers. Can a simple model cover the result of such a benchmark as well as that of a run time on the order of microseconds? Is it possible to come up with a simple model for any case at all? The typical physicist's way of testing a model for data is to write a simulation. It's quite likely a model has some truth if we can generate fake data sets from the model that look like the original, real data. For reference, here is the real data that I want to reproduce (more or less):

So I wrote a toy Monte Carlo simulation. The code is available from the dumbbench github repository in the folder simulator. Recall that the main part of the model was that I assume a normally distributed measurement around the true run time with an added set of outliers which are biased to much longer run times. That is what this MC does: For every timing, draw a random number from a normal distribution around the true value and in a fraction of all cases, add an offset (again with an uncertainty) to make it an outlier. With some fine tuning of the parameters, I get as close as this:

Yes, I know it's not the same thing. Humans are excellent at telling things apart that aren't exactly equal. But don't give up on me just yet. What you see in the picture is three lines: The black is mostly covered by the others. It's the raw distribution of times in the Monte Carlo. The red curve is the set of timings that were accepted for calculating the expectation value by the Dumbbench algorithm. The blue timings were discarded as outliers.

The simulation reproduces quite a few properties fairly well by construction: The main distribution is in the right spot and has the right width if a bit narrow. The far outliers have about the same distribution. The one striking difference is that in the real data, the main distribution isn't really following a Gaussian. It's skewed. I could try to sample from a different distribution in the simulation, but let's keep the Gaussian for a while since that's an underlying assumption of the analysis. Here's the output of the simulation:

``````Detected a required 1 iterations per run.
timings:           346
good timings:      319
outlier timings:   27
true time:         5.e-2
before correction: 5.0005e-02 +/- 3.1e-05 (mean: 0.0506163970895954)
after correction:  4.9973e-02 +/- 2.9e-05 (mean: 0.0499712054670846)
``````

correction refers to the outlier rejection done by dumbbench. Clearly, it's not a huge deal in this case. Even the uncorrected mean would have been acceptable since the fraction of outliers is so low. But this was an optimal case. Long benchmark duration, but not so long that I couldn't conveniently accumulate some data. What if I want to benchmark `++\$i` and see if it's any faster than the post-increment `\$i++`? Let's ignore the comparison for now and just look at the data I get from benchmarking the post-increment. I run dumbbench with 100000 timings, skip the dry-run subtraction, and care neither about optimizing the absolute nor relative precision:

``````perl -Ilib bin/dumbbench -i 100000 --no-dry-run -a 0 -p 0.99999 --code='\$i++' --plot_timings
``````

``````Ran 121550 iterations (21167 outliers).
Rounded run time per iteration: 4.23978e-06 +/- 1.4e-10 (0.0%)
[disregard the errors on this one]
``````

Woah! Rats, what's that? This graph shows a lot of extra complications. Most prominently, the measurement of the time is done in discrete units. That's not terribly surprising since the computer has a finite frequency. The hi-res walltime clock seems to have a clock tick of about 30ns on this machine. Another thing to note is that my computer can certainly increment Perl variables more than a million times per second, so the timing is significantly off. This is because dumbbench will go through some extra effort to run the benchmark in a clean environment. There is also the overhead of taking the time before and after running the code. This is why normally, dumbbench will subtract a (user-configurable) dry run from the result and propagate the uncertainties for you. On the upside, the main distribution looks (overall) much more Gaussian than in the long-running benchmark. Let's add the discretization effect to our model and try to simulate this data:

Considering the simplicity of what I'm putting in, this isn't all that bad! Let's see how well dumbbench can recover the true time:

``````Detected a required 32 iterations per run.
timings:           120000
good timings:      92373
outlier timings:   27627
true time:         4.25e-6
before correction: 4.247000e-06 +/- 8.9e-11 (mean: 4.31880168333532e-06)
after correction:  4.24700e-06 +/- 1.0e-10 (mean: 4.25444989336903e-06)
``````

Again, the correction isn't important. But in this case, that is mostly due to the discretization of the measurement. If there's a lot of measurements at x and at x+1 but none in between, then the median can't get any closer. If you take a look at the mean before and after correction, you can see that the outlier rejection was indeed effective. It significantly reduced the bias of the mean.

From this little experiment, I deduce that while the simple model is clearly not perfect (remember the skew of the main distribution), it isn't entirely off and works more or less across radically different conditions. Furthermore, using the model to simulate benchmarks with known "true" time, I saw that the analysis produces a good estimate of the input. It's just a toy, but it's served its purpose. I'm more confident in the setup now than before -- even without diving very far into statistics.

## Hard data for timing distributions

In the previous article, I wrote about the pitfalls of benchmarking and dumbbench, a tool that is meant to make simple benchmarks more robust.

The most significant improvement over `time ./cmd` is that it actually comes with a moderately well motivated model of the time distribution of invoking `./cmd` many times. In data analysis, it is very important to know the underlying statistical distribution of your measurement quantity. I assume that most people remember from high school that you can calculate the mean and the standard deviation ("error") of data and use those two numbers as an estimate of the true value and a statement of the uncertainty. That is a reasonable thing to do if the measurement quantity has a normal (or Gaussian) distribution. Fortunately for everybody, normal distributions are very common because if you add up enough statistics, chances are that the result will be almost Gaussian (Central Limit Theorem).

Unfortunately for you, when you run a benchmark, you would like it to produce a good answer and finish before the heat death of the universe. That means the friendly Central Limit Theorem doesn't apply cleanly and we have to put a little more thought into the matter to extract more information. In the second half of the previous article, I suggested a simple recipe for analyzing benchmark data that mostly amounted to: The main distribution of timings is Gaussian, but there is a fraction of the data, the outliers, that have significantly increased run time. If we lose those, we can calculate mean and uncertainty. But I didn't show you actual data of a reasonable benchmark run. Let's fix that:

I dumbbench as follows:

``````dumbbench -p 0.001 --code='local \$i; \$i++ for 1..1e6' --code='local \$i; \$i++ for 1..1.1e6' --code='local \$i; \$i++ for 1..1.2e6' --plot_timings
``````

With `-p 0.001`, I'm saying that I want at most an uncertainty of 0.1%. It runs three benchmarks: code1, 2, and 3. They're all the same except that code 2 runs 10% more iterations than code 1 and code 3 runs 20% more iterations. I would expect the resulting run times to be related in a similar fashion. Here is the output of the run:

``````  Ran 544 iterations of the command.
Rejected 53 samples as outliers.
Rounded run time per iteration: 4.5851e-02 +/- 4.6e-05 (0.1%)
Ran 346 iterations of the command.
Rejected 25 samples as outliers.
Rounded run time per iteration: 5.0195e-02 +/- 5.0e-05 (0.1%)
Ran 316 iterations of the command.
Rejected 18 samples as outliers.
Rounded run time per iteration: 5.4701e-02 +/- 5.4e-05 (0.1%)
``````

A little calculation shows that code2 takes 9.5% longer than code1 and code3 19.3%. Fair enough. Since I installed the SOOT module, the `--plot_timings` option will pop up a bunch of windows with plots for my amusement. Here's the timing distributions for code1 and code2:

Clearly, the two look qualitatively similar, but note the slightly different scale on the x axis. There are good and bad news. The good news are that indeed, there is a main distribution and a bunch of outliers. Clearly, getting rid of the outliers would be a win. The implemented procedure does that fairly well, but it's a bit too strict. The bad news is that the main distribution isn't entirely Gaussian. A better fit may have been a convolution of a Gaussian and an exponential, but I digress.

Let me use the digression as an excuse for another, MJD-style. brian d foy's comment on the previous entry reminded me of a convenient non-parametric way of comparing samples. The box and whisker plot:

I don't think I could explain it better than the Wikipedia article linked above, but here's a summary: For each of the three benchmarks, the respective gray box includes exactly half of the data. That is, if you cut the distribution in three chunks: The lowest 25%, the mid 50%, and the upper 25%, then the box includes the mid part. The big black marker in the box is the median of the distribution. The "error bars" (whiskers) stretch from the end of the box (i.e. 25% of data from either side of the median) to the largest (or smallest) datum that is not an outlier. Here, outliers are defined as data that is further away from the box than 1.5 times the height of the box.

At one glance, we can see that the whiskers are asymmetric and there are a lot of outliers on one side. An effective way for quickly comparing several distributions.

Back on topic: The above example benchmarked fairly long running code. A lot of times, programmers idly wonder whether some tiny bit of code will be faster than another. This is much harder to benchmark since the shorter the benchmark run, the larger the effect of small disturbances. The best solution is to change your benchmark to take longer, of course. I'll try to write about the pain of benchmarking extremely short-duration pieces of code next time.

## Your benchmarks suck!

Virtually every programmer is obsessed with writing FAST code. Curiously, this extends even to those who prefer dynamic languages such as Perl over naturally faster, more low-level languages such as C. Few among us can resist the urge to micro-optimize and a cynical version of myself would claim that the best we can expect is that programmers prove effectiveness of their optimizations with benchmarks.

Wait!

Proof? What I should have written is that they attempt to demonstrate the effect of their optimization by timing it versus another variant. Arriving at a resilient conclusion from a benchmark is hard. It doesn't only take plenty of CPU time, it also takes plenty of brain cycles and experience. People will often publish the result of a simple

``````\$ time ./somecommand --options
real    0m2.005s
user    0m2.000s
sys     0m0.005s

\$ time ./somecommand --otheroptions
real    0m3.005s
user    0m3.000s
sys     0m0.005s
``````

and even if they don't draw a conclusion themselves, they are potentially misleading others. Unfortunately, this situation isn't easily fixable. People (usually) have neither the persistence nor the expertise to do much better. Since this is a pet-peeve of mine, I tried to create an almost-drop-in replacement for "time" in the above incantation that should, on average, produce more conclusive results. It's called dumbbench and is available on github only. I claim neither completeness nor correctness.

With dumbbench, you trade extra CPU cycles for a statement of the uncertainty on the result and some robustness of the result itself. It doesn't fundamentally solve the problem that in all likeliness your benchmark doesn't matter. You now do:

``````\$ dumbbench -- ./some-command --options
Ran 23 iterations of the command.
Rejected 3 samples as outliers.
Rounded run time per iteration: 9.519e-01 +/- 3.7e-03 (0.4%)
``````

Okay, I admit this is harder to read than the original, but not much. It ran the benchmark 23 times, did some statistics with the results, decided that three of the runs were bad, and then arrived at the conclusion that your code took 0.95 seconds to run. The uncertainty on that measurement is only 0.4%.

Even if you don't care about the details, rest assured that this measurement is likely more reliable and it will give others more clues how to interpret your results.

The following essay is taken from the dumbbench documentation and goes into more detail why benchmarks suck and how it tries to work around the inevitable. How it works and why it doesn't...

# Why it doesn't work and why we try regardless

Recall that the goal is to obtain a reliable estimate of the run-time of a certain operation or command. Now, please realize that this is impossible since the run-time of an operation may depend on many things that can change rapidly: Modern CPUs change their frequency dynamically depending on load. CPU caches may be invalidated at odd moments and page faults provide less fine-grained distration. Naturally, OS kernels will do weird things just to spite you. It's almost hopeless.

Since people (you, I, everybody!) insist on benchmarking anyway, this is a best-effort at estimating the run-time. Naturally, it includes estimating the uncertainty of the run time. This is extremely important for comparing multiple benchmarks and that is usually the ultimate goal. In order to get an estimate of the expectation value and its uncertainty, we need a model of the underlying distribution:

# A model for timing results

Let's take a step back and think about how the run-time of multiple invocations of the same code will be distributed. Having a qualitative idea what the distribution of many (MANY) measurements looks like is extremely important for estimating the expectation value and uncertainty from a sample of few measurements.

In a perfect, deterministic, single-tasking computer, we will get N times the exact same timing. In the real world, there are at least a million ways that this assumption is broken on a small scale. For each run, the load of the computer will be slightly different. The content of main memory and CPU caches may differ. All of these small effects will make a given run a tiny bit slower or faster than any other. Thankfully, this is a case where statistics (more precisely the Central Limit Theorem) provides us with the qualitative result: The measurements will be normally distributed (i.e. following a Gaussian distribution) around some expectation value (which happens to be the mean in this case). Good. Unfortunately, benchmarks are more evil than that. In addition to the small-scale effects that smear the result, there are things that (at the given run time of the benchmark) may be large enough to cause a large jump in run time. Assuming these are comparatively rare and typically cause extraordinarily long run-times (as opposed to extraordinarily low run-times), we arrive at an overall model of having a central, smoothish normal distribution with a few outliers towards long run-times.

So in this model, if we perform N measurements, almost all N times will be close to the expectation value and a fraction will be significantly higher. This is troublesome because the outliers create a bias in the uncertainty estimation and the asymmetry of the overall distribution will bias a simple calculation of the mean.

What we would like to report to the user is the mean and uncertainty of the main distribution while ignoring the outliers.

Before I go into the details of how we can account for the various complications, let me show you an example of a benchmark result that defies all attempts at automatically arriving at a quantitative result. You know. Just so you don't imagine you're safe if you follow my advice!

In this example, you can see several disjoint distributions, each with its own bit of jitter around it. Possibly, the differences are caused by page faults or CPU frequency changes. I can't tell and that's exactly the point of the example because I'd wager that neither can you!

## A robust estimation of the expectation value

Given the previously discussed model, we estimate the expectation value with the following algorithm:

1. Calculate the median of the whole distribution. The median is a fairly robust estimator of the expectation value with respect to outliers (assuming they're comparatively rare).

2. Calculate the median-absolute-deviation from the whole distribution (MAD, see wikipedia). The MAD needs rescaling to become a measure of variability. The MAD will be our initial guess for an uncertainty. Like the median, it is quite robust against outliers.

3. We use the median and MAD to remove the tails of our distribution. All timings that deviate from the median by more than X times the MAD are rejected. This measure should cut outliers without introducing much bias both in symmetric and asymmetric source distributions.

An alternative would be to use an ordinary truncated mean (that is the mean of all timings while disregarding the N largest and N smallest results). But the truncated mean can produce a biased result in asymmetric source distributions. The resulting expectation value would be artificially increased.

In summary: Using the median as the initial guess for the expectation value and the MAD as the guess for the variability keeps the bias down in the general case.

4. Finally, the use the mean of the truncated distribution as the expectation value and the MAD of the truncated distribution as a measure of variability. To get the uncertainty on the expectation value, we take MAD / sqrt(N) where N is the number of remaining measurements.

## Conclusion

I hope I could convince you that interpreting less sophisticated benchmarks is a dangerous if not futile exercise. The reason this module exists is that not everybody is willing to go through such contortions to arrive at a reliable conclusion, but everybody loves benchmarking. So let's at least get the basics right. Do not compare raw timings of meaningless benchmarks but robust estimates of the run time of meaningless benchmarks instead.

## Disclaimer

This whole rant (and writing the program) was inspired by a recent thread in a certain mailing list. Neither the title nor the content of this post are intended as a slight to anybody involved in the discussion. I'm simply venting long-standing frustration.

### About Steffen Mueller

Physicist turned software developer. Working on Infrastructure Development at Booking.com. Apparently the only person who thinks the perl internals aren't nearly as bad as people make them. See also: the Booking.com tech blog, CPAN.