## Hard data for timing distributions

In the previous article, I wrote about the pitfalls of benchmarking and dumbbench, a tool that is meant to make simple benchmarks more robust.

The most significant improvement over `time ./cmd` is that it actually comes with a moderately well motivated model of the time distribution of invoking `./cmd` many times. In data analysis, it is very important to know the underlying statistical distribution of your measurement quantity. I assume that most people remember from high school that you can calculate the mean and the standard deviation ("error") of data and use those two numbers as an estimate of the true value and a statement of the uncertainty. That is a reasonable thing to do if the measurement quantity has a normal (or Gaussian) distribution. Fortunately for everybody, normal distributions are very common because if you add up enough statistics, chances are that the result will be almost Gaussian (Central Limit Theorem).

Unfortunately for you, when you run a benchmark, you would like it to produce a good answer and finish before the heat death of the universe. That means the friendly Central Limit Theorem doesn't apply cleanly and we have to put a little more thought into the matter to extract more information. In the second half of the previous article, I suggested a simple recipe for analyzing benchmark data that mostly amounted to: The main distribution of timings is Gaussian, but there is a fraction of the data, the outliers, that have significantly increased run time. If we lose those, we can calculate mean and uncertainty. But I didn't show you actual data of a reasonable benchmark run. Let's fix that:

I dumbbench as follows:

``````dumbbench -p 0.001 --code='local \$i; \$i++ for 1..1e6' --code='local \$i; \$i++ for 1..1.1e6' --code='local \$i; \$i++ for 1..1.2e6' --plot_timings
``````

With `-p 0.001`, I'm saying that I want at most an uncertainty of 0.1%. It runs three benchmarks: code1, 2, and 3. They're all the same except that code 2 runs 10% more iterations than code 1 and code 3 runs 20% more iterations. I would expect the resulting run times to be related in a similar fashion. Here is the output of the run:

``````  Ran 544 iterations of the command.
Rejected 53 samples as outliers.
Rounded run time per iteration: 4.5851e-02 +/- 4.6e-05 (0.1%)
Ran 346 iterations of the command.
Rejected 25 samples as outliers.
Rounded run time per iteration: 5.0195e-02 +/- 5.0e-05 (0.1%)
Ran 316 iterations of the command.
Rejected 18 samples as outliers.
Rounded run time per iteration: 5.4701e-02 +/- 5.4e-05 (0.1%)
``````

A little calculation shows that code2 takes 9.5% longer than code1 and code3 19.3%. Fair enough. Since I installed the SOOT module, the `--plot_timings` option will pop up a bunch of windows with plots for my amusement. Here's the timing distributions for code1 and code2:  Clearly, the two look qualitatively similar, but note the slightly different scale on the x axis. There are good and bad news. The good news are that indeed, there is a main distribution and a bunch of outliers. Clearly, getting rid of the outliers would be a win. The implemented procedure does that fairly well, but it's a bit too strict. The bad news is that the main distribution isn't entirely Gaussian. A better fit may have been a convolution of a Gaussian and an exponential, but I digress.

Let me use the digression as an excuse for another, MJD-style. brian d foy's comment on the previous entry reminded me of a convenient non-parametric way of comparing samples. The box and whisker plot: I don't think I could explain it better than the Wikipedia article linked above, but here's a summary: For each of the three benchmarks, the respective gray box includes exactly half of the data. That is, if you cut the distribution in three chunks: The lowest 25%, the mid 50%, and the upper 25%, then the box includes the mid part. The big black marker in the box is the median of the distribution. The "error bars" (whiskers) stretch from the end of the box (i.e. 25% of data from either side of the median) to the largest (or smallest) datum that is not an outlier. Here, outliers are defined as data that is further away from the box than 1.5 times the height of the box.

At one glance, we can see that the whiskers are asymmetric and there are a lot of outliers on one side. An effective way for quickly comparing several distributions.

Back on topic: The above example benchmarked fairly long running code. A lot of times, programmers idly wonder whether some tiny bit of code will be faster than another. This is much harder to benchmark since the shorter the benchmark run, the larger the effect of small disturbances. The best solution is to change your benchmark to take longer, of course. I'll try to write about the pain of benchmarking extremely short-duration pieces of code next time.

Visually, from the box-and-whisker plot, it looks like median time (on a sufficiently large sample) might a reasonable benchmark metric. I'd be curious to see if that holds up on other types of test code with a predictable relation in runtimes. (E.g. sorting N randomly distributed items for various values of N.)

I'm not discounting the value of full analysis of the shape of the distributions, but simple metrics are often very useful.

I'd also be interested in hearing your thoughts on a reasonably good statistical test to see if benchmarks are significantly different. (E.g. chi-squared on the actual result distributions?)

Every time I see your plots, I smell machine oil. Seriously. I'm taken back to the times I had to sit inside an accelerator control room and watch lights go on and off to ensure that particles where still going down the pipe. :)

Brian:

In this case, you don't really need a statistical test to check the significance of the difference between medians. A lack of overlap between the interquartile (the dark shaded bits) is sufficient to show a statistically significant difference, but that's a very conservative approach. A one way analysis of variance would also be sufficient, but results for this can become misleading for very large numbers of observations.

I'm a very big fan of the notched boxplot procedure (demo in R here). This provides a graphical estimate of the 95% confidence interval of the median and so is analogous to the T test. The original paper on notched boxplots and other creative uses of this very useful graphing procedure is here.

With a confidence interval on the sample median, we'd lose most of the problems associated with the parametric approach.
.

There's a very easy way to determine if a parametric approach has problems. This works because using a non-parametric approach results in information loss, as you no longer account for scalar information - the only information available is ordinal (i.e. we no longer assume that the difference between 1 and 2 is the same as the difference between 2 and 3 for example). So if we run a 1 way ANOVA on a bunch of data, and we run the corresponding Kruskal-Wallis test (the non-parametric equivalent) on the same data, if the p value for the parametric test is lower than the p value for the non-parametric test, then we can reasonably assume that we've reasonably met parametric procedure's assumptions. Parametric procedures are by and large pretty forgiving. Physicist turned software developer. Working on Infrastructure Development at Booking.com. Apparently the only person who thinks the perl internals aren't nearly as bad as people make them. See also: the Booking.com tech blog, CPAN.