Announcement for Sereal, a binary data serialization format, finally live!

It's been long in the making, but finally, I've gotten the Sereal announcement article in a shape that I felt somewhat comfortable with publishing. Designing and implementing Sereal was a true team effort and we really hope to see non-Perl implementations of it in the future. We're virtually committed to finish the Java decoder at least for our data-warehousing infrastructure. Any help and cooperation is welcome, as are patches to improve the actual text of the specification (which is kind of a weak point still).

By the way, for those who worried about the lack of a comment-system on the Booking.com dev blog before, we've added Disqus-support.

But now, I'm just glad it's out there!

Booking.com dev blog goes live!

I'm proud to echo the announcement that the Booking.com dev blog has just gone live. Quoting the announcement:

Booking.com is an online hotel reservations company founded during the hey-days of the dot com era in the 90s. The product offering was initially limited to just the Dutch market. We grew rapidly to expand our offerings to include 240,000+ accommodations in 171 countries used by millions of unique visitors every month - numbers which continue to grow every single day. With such growth come interesting problems of scalability, design and localisation which we love solving every day.

The blog is kicked off with just a quick, humble article of mine on a debugging module that I published after needing the functionality at work. In a given code location, it allows you to find where in the code base the current set of signal handlers were set up. We plan to publish new content regularly and have a few interesting stories already lined up. So stay tuned!

New Data::Dumper release: 50% faster

Data::Dumper version 2.136 was just uploaded to CPAN. It's been over a year since the latest stable release of the module. Generally, I just synchronize changes to the module from the Perl core to CPAN releases and do so very carefully with lots of development releases.

Recently, however, there was a reason to look at Data::Dumper performance critically. A very simple change meant a speed-up of the order of 50% on my test data set. In a nutshell, Data::Dumper used to track each and every value in the data structure just in case you were going to want to use the Seen functionality. That pertains to a tiny fraction of all Data::Dumper uses and everybody was having to pay for it. For example, if you're using the functional interface (like most), then you wouldn't even ever get access to that information, yet everything was being tracked instead of just things with high reference counts.

With Data::Dumper 2.136, the functional interface has become faster unconditionally. If you use the OO interface, you may be one of the few people that care about the old Seen feature. That means you have to opt in to the new optimization by setting the Sparseseen option of the object. If you do, the Seen hash will be useless. Alternatively, you can globally enable the optimization by setting $Data::Dumper::Sparseseen = 1.

At the same time, the new release ports several bug fixes from the perl core. Unfortunately, some of those changes turned out to be incompatible with older versions of Perl. More specifically, it appears that there is one vstring related change that breaks some vstring tests on 5.8. I don't currently have the time to investigate. If you are affected by this, why don't you step up and help out to restore full compatibility?

A big thanks to my employer, Booking.com, for letting me spend work time on this optimization.

The physicist's way out

Previously, I wrote about modeling the result of repeated benchmarks. It turns out that this isn't easy. Different effects are important when you benchmark run times of different magnitudes. The previous example ran for about 0.05 seconds. That's an eternity for computers. Can a simple model cover the result of such a benchmark as well as that of a run time on the order of microseconds? Is it possible to come up with a simple model for any case at all? The typical physicist's way of testing a model for data is to write a simulation. It's quite likely a model has some truth if we can generate fake data sets from the model that look like the original, real data. For reference, here is the real data that I want to reproduce (more or less):

slow benchmark

So I wrote a toy Monte Carlo simulation. The code is available from the dumbbench github repository in the folder simulator. Recall that the main part of the model was that I assume a normally distributed measurement around the true run time with an added set of outliers which are biased to much longer run times. That is what this MC does: For every timing, draw a random number from a normal distribution around the true value and in a fraction of all cases, add an offset (again with an uncertainty) to make it an outlier. With some fine tuning of the parameters, I get as close as this:

slow toy MC

Yes, I know it's not the same thing. Humans are excellent at telling things apart that aren't exactly equal. But don't give up on me just yet. What you see in the picture is three lines: The black is mostly covered by the others. It's the raw distribution of times in the Monte Carlo. The red curve is the set of timings that were accepted for calculating the expectation value by the Dumbbench algorithm. The blue timings were discarded as outliers.

The simulation reproduces quite a few properties fairly well by construction: The main distribution is in the right spot and has the right width if a bit narrow. The far outliers have about the same distribution. The one striking difference is that in the real data, the main distribution isn't really following a Gaussian. It's skewed. I could try to sample from a different distribution in the simulation, but let's keep the Gaussian for a while since that's an underlying assumption of the analysis. Here's the output of the simulation:

Detected a required 1 iterations per run.
timings:           346
good timings:      319
outlier timings:   27
true time:         5.e-2
before correction: 5.0005e-02 +/- 3.1e-05 (mean: 0.0506163970895954)
after correction:  4.9973e-02 +/- 2.9e-05 (mean: 0.0499712054670846)

correction refers to the outlier rejection done by dumbbench. Clearly, it's not a huge deal in this case. Even the uncorrected mean would have been acceptable since the fraction of outliers is so low. But this was an optimal case. Long benchmark duration, but not so long that I couldn't conveniently accumulate some data. What if I want to benchmark ++$i and see if it's any faster than the post-increment $i++? Let's ignore the comparison for now and just look at the data I get from benchmarking the post-increment. I run dumbbench with 100000 timings, skip the dry-run subtraction, and care neither about optimizing the absolute nor relative precision:

perl -Ilib bin/dumbbench -i 100000 --no-dry-run -a 0 -p 0.99999 --code='$i++' --plot_timings

short benchmark distribution

Ran 121550 iterations (21167 outliers).
Rounded run time per iteration: 4.23978e-06 +/- 1.4e-10 (0.0%)
[disregard the errors on this one]

Woah! Rats, what's that? This graph shows a lot of extra complications. Most prominently, the measurement of the time is done in discrete units. That's not terribly surprising since the computer has a finite frequency. The hi-res walltime clock seems to have a clock tick of about 30ns on this machine. Another thing to note is that my computer can certainly increment Perl variables more than a million times per second, so the timing is significantly off. This is because dumbbench will go through some extra effort to run the benchmark in a clean environment. There is also the overhead of taking the time before and after running the code. This is why normally, dumbbench will subtract a (user-configurable) dry run from the result and propagate the uncertainties for you. On the upside, the main distribution looks (overall) much more Gaussian than in the long-running benchmark. Let's add the discretization effect to our model and try to simulate this data:

fast toy MC

Considering the simplicity of what I'm putting in, this isn't all that bad! Let's see how well dumbbench can recover the true time:

Detected a required 32 iterations per run.
timings:           120000
good timings:      92373
outlier timings:   27627
true time:         4.25e-6
before correction: 4.247000e-06 +/- 8.9e-11 (mean: 4.31880168333532e-06)
after correction:  4.24700e-06 +/- 1.0e-10 (mean: 4.25444989336903e-06)

Again, the correction isn't important. But in this case, that is mostly due to the discretization of the measurement. If there's a lot of measurements at x and at x+1 but none in between, then the median can't get any closer. If you take a look at the mean before and after correction, you can see that the outlier rejection was indeed effective. It significantly reduced the bias of the mean.

From this little experiment, I deduce that while the simple model is clearly not perfect (remember the skew of the main distribution), it isn't entirely off and works more or less across radically different conditions. Furthermore, using the model to simulate benchmarks with known "true" time, I saw that the analysis produces a good estimate of the input. It's just a toy, but it's served its purpose. I'm more confident in the setup now than before -- even without diving very far into statistics.

Hard data for timing distributions

In the previous article, I wrote about the pitfalls of benchmarking and dumbbench, a tool that is meant to make simple benchmarks more robust.

The most significant improvement over time ./cmd is that it actually comes with a moderately well motivated model of the time distribution of invoking ./cmd many times. In data analysis, it is very important to know the underlying statistical distribution of your measurement quantity. I assume that most people remember from high school that you can calculate the mean and the standard deviation ("error") of data and use those two numbers as an estimate of the true value and a statement of the uncertainty. That is a reasonable thing to do if the measurement quantity has a normal (or Gaussian) distribution. Fortunately for everybody, normal distributions are very common because if you add up enough statistics, chances are that the result will be almost Gaussian (Central Limit Theorem).

Unfortunately for you, when you run a benchmark, you would like it to produce a good answer and finish before the heat death of the universe. That means the friendly Central Limit Theorem doesn't apply cleanly and we have to put a little more thought into the matter to extract more information. In the second half of the previous article, I suggested a simple recipe for analyzing benchmark data that mostly amounted to: The main distribution of timings is Gaussian, but there is a fraction of the data, the outliers, that have significantly increased run time. If we lose those, we can calculate mean and uncertainty. But I didn't show you actual data of a reasonable benchmark run. Let's fix that:

I dumbbench as follows:

dumbbench -p 0.001 --code='local $i; $i++ for 1..1e6' --code='local $i; $i++ for 1..1.1e6' --code='local $i; $i++ for 1..1.2e6' --plot_timings

With -p 0.001, I'm saying that I want at most an uncertainty of 0.1%. It runs three benchmarks: code1, 2, and 3. They're all the same except that code 2 runs 10% more iterations than code 1 and code 3 runs 20% more iterations. I would expect the resulting run times to be related in a similar fashion. Here is the output of the run:

  Ran 544 iterations of the command.
  Rejected 53 samples as outliers.
  Rounded run time per iteration: 4.5851e-02 +/- 4.6e-05 (0.1%)
  Ran 346 iterations of the command.
  Rejected 25 samples as outliers.
  Rounded run time per iteration: 5.0195e-02 +/- 5.0e-05 (0.1%)
  Ran 316 iterations of the command.
  Rejected 18 samples as outliers.
  Rounded run time per iteration: 5.4701e-02 +/- 5.4e-05 (0.1%)

A little calculation shows that code2 takes 9.5% longer than code1 and code3 19.3%. Fair enough. Since I installed the SOOT module, the --plot_timings option will pop up a bunch of windows with plots for my amusement. Here's the timing distributions for code1 and code2:

base benchmark

base benchmark + 10%

Clearly, the two look qualitatively similar, but note the slightly different scale on the x axis. There are good and bad news. The good news are that indeed, there is a main distribution and a bunch of outliers. Clearly, getting rid of the outliers would be a win. The implemented procedure does that fairly well, but it's a bit too strict. The bad news is that the main distribution isn't entirely Gaussian. A better fit may have been a convolution of a Gaussian and an exponential, but I digress.

Let me use the digression as an excuse for another, MJD-style. brian d foy's comment on the previous entry reminded me of a convenient non-parametric way of comparing samples. The box and whisker plot:

box plot

I don't think I could explain it better than the Wikipedia article linked above, but here's a summary: For each of the three benchmarks, the respective gray box includes exactly half of the data. That is, if you cut the distribution in three chunks: The lowest 25%, the mid 50%, and the upper 25%, then the box includes the mid part. The big black marker in the box is the median of the distribution. The "error bars" (whiskers) stretch from the end of the box (i.e. 25% of data from either side of the median) to the largest (or smallest) datum that is not an outlier. Here, outliers are defined as data that is further away from the box than 1.5 times the height of the box.

At one glance, we can see that the whiskers are asymmetric and there are a lot of outliers on one side. An effective way for quickly comparing several distributions.

Back on topic: The above example benchmarked fairly long running code. A lot of times, programmers idly wonder whether some tiny bit of code will be faster than another. This is much harder to benchmark since the shorter the benchmark run, the larger the effect of small disturbances. The best solution is to change your benchmark to take longer, of course. I'll try to write about the pain of benchmarking extremely short-duration pieces of code next time.