In the previous article, I wrote about the pitfalls of benchmarking and dumbbench, a tool that is meant to make simple benchmarks more robust.
The most significant improvement over
time ./cmd is that it actually comes with a moderately well motivated model of the time distribution of invoking
./cmd many times. In data analysis, it is very important to know the underlying statistical distribution of your measurement quantity. I assume that most people remember from high school that you can calculate the mean and the standard deviation ("error") of data and use those two numbers as an estimate of the true value and a statement of the uncertainty. That is a reasonable thing to do if the measurement quantity has a normal (or Gaussian) distribution. Fortunately for everybody, normal distributions are very common because if you add up enough statistics, chances are that the result will be almost Gaussian (Central Limit Theorem).
Unfortunately for you, when you run a benchmark, you would like it to produce a good answer and finish before the heat death of the universe. That means the friendly Central Limit Theorem doesn't apply cleanly and we have to put a little more thought into the matter to extract more information. In the second half of the previous article, I suggested a simple recipe for analyzing benchmark data that mostly amounted to: The main distribution of timings is Gaussian, but there is a fraction of the data, the outliers, that have significantly increased run time. If we lose those, we can calculate mean and uncertainty. But I didn't show you actual data of a reasonable benchmark run. Let's fix that:
I dumbbench as follows:
dumbbench -p 0.001 --code='local $i; $i++ for 1..1e6' --code='local $i; $i++ for 1..1.1e6' --code='local $i; $i++ for 1..1.2e6' --plot_timings
-p 0.001, I'm saying that I want at most an uncertainty of 0.1%. It runs three benchmarks: code1, 2, and 3. They're all the same except that code 2 runs 10% more iterations than code 1 and code 3 runs 20% more iterations. I would expect the resulting run times to be related in a similar fashion. Here is the output of the run:
Ran 544 iterations of the command.
Rejected 53 samples as outliers.
Rounded run time per iteration: 4.5851e-02 +/- 4.6e-05 (0.1%)
Ran 346 iterations of the command.
Rejected 25 samples as outliers.
Rounded run time per iteration: 5.0195e-02 +/- 5.0e-05 (0.1%)
Ran 316 iterations of the command.
Rejected 18 samples as outliers.
Rounded run time per iteration: 5.4701e-02 +/- 5.4e-05 (0.1%)
A little calculation shows that code2 takes 9.5% longer than code1 and code3 19.3%. Fair enough. Since I installed the SOOT module, the
--plot_timings option will pop up a bunch of windows with plots for my amusement. Here's the timing distributions for code1 and code2:
Clearly, the two look qualitatively similar, but note the slightly different scale on the x axis. There are good and bad news. The good news are that indeed, there is a main distribution and a bunch of outliers. Clearly, getting rid of the outliers would be a win. The implemented procedure does that fairly well, but it's a bit too strict. The bad news is that the main distribution isn't entirely Gaussian. A better fit may have been a convolution of a Gaussian and an exponential, but I digress.
Let me use the digression as an excuse for another, MJD-style. brian d foy's comment on the previous entry reminded me of a convenient non-parametric way of comparing samples. The box and whisker plot:
I don't think I could explain it better than the Wikipedia article linked above, but here's a summary: For each of the three benchmarks, the respective gray box includes exactly half of the data. That is, if you cut the distribution in three chunks: The lowest 25%, the mid 50%, and the upper 25%, then the box includes the mid part. The big black marker in the box is the median of the distribution. The "error bars" (whiskers) stretch from the end of the box (i.e. 25% of data from either side of the median) to the largest (or smallest) datum that is not an outlier. Here, outliers are defined as data that is further away from the box than 1.5 times the height of the box.
At one glance, we can see that the whiskers are asymmetric and there are a lot of outliers on one side. An effective way for quickly comparing several distributions.
Back on topic: The above example benchmarked fairly long running code. A lot of times, programmers idly wonder whether some tiny bit of code will be faster than another. This is much harder to benchmark since the shorter the benchmark run, the larger the effect of small disturbances. The best solution is to change your benchmark to take longer, of course. I'll try to write about the pain of benchmarking extremely short-duration pieces of code next time.