Measuring Rakudo performance

You can now create "statistics" about Rakudo performance (perhaps to augment your "lies" and "damn lies" ;-) using new funtionality added to the 'tools/test_summary.pl' script. It's under the title "Simple Relative Benchmarking", and needs at least two runs to give usable data. Please try it, and compare findings in #perl6.

With up to about 2% additional run time, test_summary gets Rakudo's Test.pm to emit per-test timestamps to microsecond resolution. Then test_summary calculates the (real) execution time of each of the more than 30000 spectests and inserts the results into 'docs/test_summary.times' in JSON format. That accumulates the results of the five most recent complete runs in FIFO fashion, occupying about 2.5 MB.

The times make interesting reading, but with the usual caution not to conclude too much from benchmarks. Before these details were available, Rakudo developers would compare only the total execution time of an entire batch of tests. The total time was generally stable, and changes up or down of less that 3% were usually ignored. Bigger changes were mainly attributed to refactoring and optimization within Parrot. It was therefore surprising to discover that individual test timings were much more volatile than that, even with roughly the same total batch time.

With five samples to compare, some tests are very consistent in their execution times and others are very erratic. External factors such as CPU time spent in other processes are certainly resposible for some fluctuation (it's only "real" or wallclock time). But extended observation of more than 5 runs on an otherwise quiescent computer indicates that internally, execution sometimes "freezes" for up to a few seconds. These "spikes" in execution time seem to have nothing to with the test itself, because if the run is repeated with identical software, different tests show the effect. There is a general ceiling for the duration of these random delays, and it seems to be unrelated to the typical time taken by the test.

There is also a fairly solid floor, below which test execution time never drops. This is only partly due to the overhead of logging the timestamps. You can see how much by exporting the PERL6_TEST_TIMES variable with any value, and then running a single test script directly with Rakudo. The difference between the first two consecutive timestamps at the top of the TAP ouput is the logging overhead.

The major part of the minimum test execution time is probably ok() and its siblings in Test.pm. They all end up calling proclaim(), which also conditionally outputs the timestamps. Some more work could be done in this area, perhaps logging additional data such as the minimum test execution time and the minimum executable startup time. By subtracting such minima from the test times, the differences between the remaining values would be clearer.

In order to better navigate so much data, there are two reports in the POD at the end of test_summary.pl. They are written in Perl 5, because Rakudo became glacial when trying to do something like this. After you have run at least two test_summary batches, paste the reports into temporary scripts and view the output (also about 2.5 MB) in a pager or editor.

The first report was one of the main goals of all this benchmarking effort, to show the change in execution time for each test between the last two test runs. The idea was to identify the tests that speeded up or slowed down the most, presumably because of recent software changes. Alas, too much randomness in the timings has muddled the outcome so far.

The second report analyses how much the timings for each test vary. It extracts the minimum and maximum times, and sorts all the results by percentage difference. After five runs, the median test (almost number 16000 out of 32000) shows a difference of just over 25%. There are about 400 tests out of 32000 (about 1.2%) where the maximum time is over twice the minimum. Fortunately, at the other end the same number of tests have five timings within 2.8% of each other, showing good stability.

Is it too hasty a conclusion to speculate that garbage collection is behind these random delays? Some memory usage profiling about a month ago pointed to that effect, and whilst some memory leaks have been plugged since, the main use of memory is basically unchanged.

Absolute figures are not as important as the differences (hence the term relative benchmarking). For what it's worth, these statistics come from an 1.6GHz Intel Atom processor with hyperthreading enabled, 1GB memory and a 16GB solid state flash drive (basically an eeePC netbook) running Ubuntu Linux 10.04. A second system with more than twice the memory and speed corroborated the results.

This timing software was added to associate performance changes with updates in Rakudo and Parrot. So far it has failed to do that. The "lesson learned" instead is that the software being monitored is very complex and therefore difficult to pin down, performance-wise.

Some other Rakudo adopters may be able to extend or improve the work done to date. One idea not yet implemented is to filter out the irregular spikes in execution times as noise, and then rewrite the first report with cleaner data. It might be an interesting subject for one of moritz_++'s weekly challenges.

4 Comments

Are these benchmark results available anywhere? Is there any sort of comparison between Rakudo and, e.g., Perl 5?

It's indeed hasty to suspect GC for random delays. Barring memory corruption (and randomized addresses as a form of OS-level protection) and any sources of randomness in the code executed (user input, random numbers, non-deterministic numbers of tests run), the GC profile of a program between runs is deterministic.

You could compare the Euler problems in Perl 6 and Perl 5. The former will be *much* slower though. I can't even run the Perl 6 spectests anymore due to it eating up all the resources on my puny machines.

Leave a comment

About Martin Berends

user-pic I contribute to Perl 6.