The physicist's way out
Previously, I wrote about modeling the result of repeated benchmarks. It turns out that this isn't easy. Different effects are important when you benchmark run times of different magnitudes. The previous example ran for about 0.05 seconds. That's an eternity for computers. Can a simple model cover the result of such a benchmark as well as that of a run time on the order of microseconds? Is it possible to come up with a simple model for any case at all? The typical physicist's way of testing a model for data is to write a simulation. It's quite likely a model has some truth if we can generate fake data sets from the model that look like the original, real data. For reference, here is the real data that I want to reproduce (more or less):
So I wrote a toy Monte Carlo simulation. The code is available from the dumbbench github repository in the folder simulator. Recall that the main part of the model was that I assume a normally distributed measurement around the true run time with an added set of outliers which are biased to much longer run times. That is what this MC does: For every timing, draw a random number from a normal distribution around the true value and in a fraction of all cases, add an offset (again with an uncertainty) to make it an outlier. With some fine tuning of the parameters, I get as close as this:
Yes, I know it's not the same thing. Humans are excellent at telling things apart that aren't exactly equal. But don't give up on me just yet. What you see in the picture is three lines: The black is mostly covered by the others. It's the raw distribution of times in the Monte Carlo. The red curve is the set of timings that were accepted for calculating the expectation value by the Dumbbench algorithm. The blue timings were discarded as outliers.
The simulation reproduces quite a few properties fairly well by construction: The main distribution is in the right spot and has the right width if a bit narrow. The far outliers have about the same distribution. The one striking difference is that in the real data, the main distribution isn't really following a Gaussian. It's skewed. I could try to sample from a different distribution in the simulation, but let's keep the Gaussian for a while since that's an underlying assumption of the analysis. Here's the output of the simulation:
Detected a required 1 iterations per run. timings: 346 good timings: 319 outlier timings: 27 true time: 5.e-2 before correction: 5.0005e-02 +/- 3.1e-05 (mean: 0.0506163970895954) after correction: 4.9973e-02 +/- 2.9e-05 (mean: 0.0499712054670846)
correction refers to the outlier rejection done by dumbbench. Clearly, it's not a huge deal in this case. Even the uncorrected mean would have been acceptable since the fraction of outliers is so low. But this was an optimal case. Long benchmark duration, but not so long that I couldn't conveniently accumulate some data. What if I want to benchmark
++$i and see if it's any faster than the post-increment
$i++? Let's ignore the comparison for now and just look at the data I get from benchmarking the post-increment. I run dumbbench with 100000 timings, skip the dry-run subtraction, and care neither about optimizing the absolute nor relative precision:
perl -Ilib bin/dumbbench -i 100000 --no-dry-run -a 0 -p 0.99999 --code='$i++' --plot_timings
Ran 121550 iterations (21167 outliers). Rounded run time per iteration: 4.23978e-06 +/- 1.4e-10 (0.0%) [disregard the errors on this one]
Woah! Rats, what's that? This graph shows a lot of extra complications. Most prominently, the measurement of the time is done in discrete units. That's not terribly surprising since the computer has a finite frequency. The hi-res walltime clock seems to have a clock tick of about 30ns on this machine. Another thing to note is that my computer can certainly increment Perl variables more than a million times per second, so the timing is significantly off. This is because dumbbench will go through some extra effort to run the benchmark in a clean environment. There is also the overhead of taking the time before and after running the code. This is why normally, dumbbench will subtract a (user-configurable) dry run from the result and propagate the uncertainties for you. On the upside, the main distribution looks (overall) much more Gaussian than in the long-running benchmark. Let's add the discretization effect to our model and try to simulate this data:
Considering the simplicity of what I'm putting in, this isn't all that bad! Let's see how well dumbbench can recover the true time:
Detected a required 32 iterations per run. timings: 120000 good timings: 92373 outlier timings: 27627 true time: 4.25e-6 before correction: 4.247000e-06 +/- 8.9e-11 (mean: 4.31880168333532e-06) after correction: 4.24700e-06 +/- 1.0e-10 (mean: 4.25444989336903e-06)
Again, the correction isn't important. But in this case, that is mostly due to the discretization of the measurement. If there's a lot of measurements at x and at x+1 but none in between, then the median can't get any closer. If you take a look at the mean before and after correction, you can see that the outlier rejection was indeed effective. It significantly reduced the bias of the mean.
From this little experiment, I deduce that while the simple model is clearly not perfect (remember the skew of the main distribution), it isn't entirely off and works more or less across radically different conditions. Furthermore, using the model to simulate benchmarks with known "true" time, I saw that the analysis produces a good estimate of the input. It's just a toy, but it's served its purpose. I'm more confident in the setup now than before -- even without diving very far into statistics.