These benchmarks seem wrong...
Back in the fall of 2013 I began working on a project called C::Blocks. After some very long detours the project is finally coming to fruition. I recently took it for a spin on a benchmark from the benchmarksgame. The results? Let's just say I was very surprised.
I should say a word about C::Blocks before going into the details. This module uses Perl's keyword API. A few of the keywords let you declare C functions, typedefs, and struct definitions, while a few of the keywords make it easy to embed blocks of C code directly into your Perl script. The embedded blocks can use the function and struct definitions provided elsewhere, and it's very easy to share functions and other declarations between modules. All of the C code is jit-compiled by the very fast Tiny C Compiler. The purpose of these benchmarks is to compare other implementations against the performance of the code produced by the Tiny C Compiler.
The full details of the benchmark, including a writeup, are included in the distribution's git repository on github. The collection includes two implementations written using C::Blocks, and four written using PDL. I compare against the benchmarksgame's Perl version in the writeup.
Happily, the C::Blocks versions run quickly, beating the Perl implementation by a factor of 10 for systems sizes larger than 200.
The big surprise, a shock to me, and the reason I suspect the PDL benchmarks may be flawed, is that I cannot get a PDL benchmark to do significantly better than the pure-Perl version. The PDL versions are not multi-threaded, while the pure-Perl version is, but that should only account for a factor of 2 or 3. I would have expected the PDL versions to beat the pure-Perl versions handily, at least for some range of system sizes, and to be competitive with the C::Blocks versions.
In these benchmarks, the PDL versions don't get anywhere close to as fast as the C::Blocks versions. OK, for the small 200x200 case the fastest PDL implementation is within a factor of 5 of the slowest C::Blocks implementation, but factors of 30 are more typical.
Can you produce a faster PDL implementation? Let me know! I'd like to see it and compare it against my C::Blocks versions.
P. S. If you want to install C::Blocks, you should work with the version on github. I haven't released the latest changes to CPAN just yet. I know it'll install on Ubuntu with perlbrew, and Windows with Strawberry Perl v5.22. Feel free to file bug reports at github if you try on something else and it doesn't work out.
From Chris Marshall, who had trouble signing in to blogs.perl.org:
Hi David, nice to see work again on C::Blocks. The mandelbrot algorithm is likely a pathological case for PDL-2.015 for a number of reasons:
(1) PDL computations are basically memory to memory without things like register operations and often the cache is broken for large data,
(2) even the optimized algorithms appear to have a large number of PDL creation operations---very expensive,
(3) PDL can be very efficient in compute and memory for *uniform* computations but here the memory breaks the cache and introduces page faults.
One way to get comparable performance with the current PDL implementation is pretty much the same way you've done in C::Blocks---implement a tight loop that is operation and memory efficient. (Hint: See the documentation for PDL::PP)