My work on fannkuch led me to recommend the optimization for reversing an array to itself that was added to Perl 5.12 (e.g. @a = reverse @a
), so I think that looking at what's slow to figure out what we could make faster in Perl is still a worthwhile exercise.
Sure, maybe i did, but then, maybe - just maybe - i didn't.
Let me be more clear about what I did, not to let people think my measurements were somehow influenced by phases-of-the-moon. I verified my results with a friend who incidentally runs x86_64 Debian, my machine ran 32bit at the time. Also, perlbrew made it trivial for me to test across threads/no-threads/10.x/12.x/14.x/etc. All differences were insignificant between the perls run in contrast with what was kindda constant, the fact that all ran about 2x (give or take, depending on threads/no-threads, but around 2x) the speed of the python3 implementation. I generated the input with fasta 5000000 and ... obviously gave the same input to all programs, i surely did not make a larger file for the python3 program :). And sure, it all must've gone very wrong somewhere, because on the regex-dna python3 vs perl, perl is said to be 2x slower, which is, well, about the reverse of what me and my friend both got. As you so bluntly expect me to have made a mistake somewhere, I also expect that there's a bug in the shootout measurement code and that anyone who runs this particular benchmark should get similar results to mine. All with a "maybe" and a smiley. :)
]]>I also over micro-optimized a bit. No stringifiy in print only saves some microseconds outside the loop.
I'm more worried that all my compiled versions are slower than the pure perl versions.
Esp. the perlcc -O typed versions. Not ready yet for primetime
Some more micro-optimization:
multi + thread is usually not used on linux/bsd.
If something like threads are wanted, than usually Coro.
Is this default at debian?
The fastest is usually multi without ithreads, some distros also have just no ithreads without multi.
Only windows uses ithreads to emulate fork.
I compared
http://shootout.alioth.debian.org/u32q/measurements.php?lang=perl
vs.
http://shootout.alioth.debian.org/u32/measurements.php?lang=perl
multi non-multi CPU-secs (lower number is faster)
binary-trees 761.37 662.98
fannkuch-redux 3,210.13 2,756.98
fasta 2.38 2.18
k-nucleotide 239.10 226.75
mandelbrot 4,080.26 3,911.21
meteor-contest 63.30 55.92
n-body 1,733.08 1,563.23
pidigits 6.34 6.04
regex-dna 34.97 30.59
reverse-complement 4.88 4.72
spectral-norm 1,094.59 939.56
Why I though that multi is faster (but is not)
If I compare the assembly code for both versions I see that the multi version uses fast stack access for PL_op compared to slow heap access for PL_op.
Compare
http://cpansearch.perl.org/src/RURBAN/Jit-0.04_09/i386thr.c
against
http://cpansearch.perl.org/src/RURBAN/Jit-0.04_09/i386.c
Thanks for updating the benchmarks!