Optimizing compiler benchmarks (part 4)
nbody - More optimizations
In the first part I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an regexp example which was very bad to optimize.
In the second part I got 2 times faster run-times with the B::CC compiler with the nbody benchmark, which does a lot of arithmetic.
In the third part I got 4.5 times faster run-times with perl-level AELEMFAST optimizations, and discussed optimising array accesses via no autovivification or types.
Optimising array accesses showed the need for autovivification detection in B::CC and better stack handling for more ops and datatypes, esp. aelem and helem.
But first let's study more easier goals to accomplish. If we look at
the generated C source for a simple arithmetic function, like
pp_sub_offset_momentum
we immediately detect more possibilities.
static
CCPP(pp_sub_offset_momentum)
{
SV *sv, *src, *dst, *left, *right;
NV rnv0, lnv0, d1_px, d2_py, d3_pz, d4_mass, d7_tmp, d10_tmp, d13_tmp, d15_tmp, d17_tmp, d19_tmp, d21_tmp, d23_tmp, d25_tmp, d27_tmp, d29_tmp, d31_tmp, d33_tmp, d35_tmp, d37_tmp, d40_tmp, d42_tmp, d44_tmp;
PERL_CONTEXT *cx;
MAGIC *mg;
I32 oldsave, gimme;
dSP;
lab_2a41220:
TAINT_NOT; /* only needed once */
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* only needed once */
FREETMPS; /* only needed once */
SAVECLEARSV(PL_curpad[1]); /* not needed at all */
d1_px = 0.00;
lab_2a41370:
TAINT_NOT; /* only needed once */
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
FREETMPS; /* only needed once */
SAVECLEARSV(PL_curpad[2]); /* not needed at all */
d2_py = 0.00;
lab_2a50a00:
TAINT_NOT; /* only needed once */
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
FREETMPS; /* only needed once */
SAVECLEARSV(PL_curpad[3]); /* not needed at all */
d3_pz = 0.00;
lab_2a50b30:
TAINT_NOT; /* only needed once */
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
FREETMPS; /* only needed once */
SAVECLEARSV(PL_curpad[4]); /* not needed at all */
lab_2a50cc0:
TAINT_NOT; /* only needed once */
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
FREETMPS; /* only needed once */
PUSHs(AvARRAY(MUTABLE_AV(PL_curpad[5]))[0]); /* no autovivification */
sv = POPs;
MAYBE_TAINT_SASSIGN_SRC(sv); /* not needed */
SvSetMagicSV(PL_curpad[4], sv); /* i.e. PL_curpad[4] = sv; */
...
We can study the expanded macros with:
cc_harness -DOPT -E -O2 -onbody.perl-2.perl-1.i nbody.perl-2.perl.c
TAINT_NOT
does (PL_tainted = (0))
. It is needed only once, because nobody
changes PL_tainted
. We can also ignore taint checks generally by setting -fomit_taint
.
perl -MO=Concise,offset_momentum nbody.perl-2a.perl
main::offset_momentum:
42 <1> leavesub[1 ref] K/REFC,1 ->(end)
- <@> lineseq KP ->42
1 <;> nextstate(main 141 (eval 5):4) v ->2
4 <2> sassign vKS/2 ->5
2 <$> const(NV 0) s ->3
3 <0> padsv[$px:141,145] sRM*/LVINTRO ->4
...
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
is the 2nd part of the inlined code for
nextstate
and resets the stack pointer. As we keep track of the stack by ourselves we can
omit most of these resets in nextstate.
FREETMPS
is also part of nextstate
, and calling it after each basic
block is optimized by -O1, and -O2 would free the temps after each
loop. If FREETMPS is needed at all, i.e. if locals are used in the
function at all, is not checked yet.
SAVECLEARSV(PL_curpad[1-4])
is part of padsv /LVINTRO
, but here unneeded, since
it is in the context of sassign. So the value of the lexical does not need to be cleared
before it is set. And btw. the setter of the lexical is already optimized to a temporary.
MAYBE_TAINT_SASSIGN_SRC(sv)
is part of sassign
and can be omitted with -fomit_taint
,
and since we are at TAINT_NOT
we can leave it out.
SvSetMagicSV(PL_curpad[4], sv)
is also part of the optimized sassign
op, just not
yet optimized enough, since sv cannot have any magic. A type declaration for the padsv
would have used the faster equivalent SvNV(PL_curpad[4]) = SvNV(sv);
put on the stack.
We can easily test this out by NOP'ing these code sections and see the costs.
With 4m53.073s, without 4m23.265s. 30 seconds or ~10% faster. This is now in the typical range of p5p micro-optimizations and not considered high-priority for now.
Let's rather check out more stack optimizations.
I added a new B::Stackobj::Aelem object to B::Stackobj to track aelemfast accesses to array indices, and do the PUSH/POP optimizations on them.
The generated code now looks like:
lab_116f270:
TAINT_NOT;
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
FREETMPS;
rnv0 = d9_mag; lnv0 = SvNV(AvARRAY((AV*)PL_curpad[25])[1]); /* multiply */
d3_mm2 = lnv0 * rnv0;
lab_116be90:
TAINT_NOT;
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
FREETMPS;
d5_dx = SvNV(PL_curpad[5]);
rnv0 = d3_mm2; lnv0 = d5_dx; /* multiply */
d29_tmp = lnv0 * rnv0;
SvNVX(AvARRAY((AV*)PL_curpad[28])[0]) = SvNVX(AvARRAY((AV*)PL_curpad[28])[0]) - d29_tmp;
Lvalue assignments need SvNVX, right-value can keep SvNV.
The multiply op for PL_curpad[28])[0]
has the OPf_MOD flag since the first arg is modified.
nextstate with TAINT, FREETMPS and sp reset is still not optimized.
Performance went from 4m53.073s to 3m58.249s, 55s or 18.7% faster. Much better than with the nextstate optimizations. 30s less on top of this would be 3m30s, still slower than Erlang, Racket or C#. And my goal was 2m30s.
But there's still a lot to optimize (loop unrolling, aelem, helem, ...) and adding the no autovivification check was also costly. Several dependant packages were added to the generated code, like autovivification, Tie::Hash::NamedCapture, mro, Fcntl, IO, Exporter, Cwd, File::Spec, Config, FileHandle, IO::Handle, IO::Seekable, IO::File, Symbol, Exporter::Heavy, ... But you don't see this cost in the binary size, and neither in the run-time.
I also tested the fannkuchredux benchmark, which was created for a bad LISP compiler in 1994, also with array accessors.
Uncompiled with N=10 I got 16.093s, compiled 9.1222s, almost 2x times faster (1.75x). And this code has the same aelem problem as nbody, so a loop unrolling to aelemfast and better direct accessors with no-autovivification would lead to a ~4x times faster run-time.
nextstate optimisations
nextstate and its COP brother dbstate are mainly used to store the line number of the current op for debugging. I wrote an oplines patch already to move the line info to all OPs, which reduced the need for 90% nextstate ops, which would overcome the problem we are facing here:
PL_op = &curcop_list[0];
TAINT_NOT; /* only needed once */
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* rarely needed */
FREETMPS; /* rarely needed, only with TMPs */
oplines is not yet usable because it only reduces the number of nextstate ops, but I haven't written the needed change to warnings and error handling which would be needed to search for the current cop with warn or die, to be able to display the file name together with the line number.
A different strategy would be to create simplier state COPs, without TAINT check,
without stack reset and without FREETMPS.
Like state, state_t, state_s, state_f, state_ts, state_sf, state_tsf == nextstate
.
TBC...
A general 10% speed-up (run-time, not start-up) would very much be considered a significant gain by a significant number of people on p5p. That includes at least Yves Orton and me, plus several others that I feel less confident to name.
Specifically, if it's a generally applicable change, I am quite confident that if somebody were willing to do the actual work (including shepherding it through peer review and getting it applied, with my help), I could find some financial sponsorship for the effort.
10% is a big fucking deal, if it's an overall speedup on a macroscopic, meaningful program.
I concur that 10% would be a big deal to some, but my current results of 550% are not?
Why on hell are you proposing omitting taint checks at all, when
1. optimizing TAINT checks would gain 10% (with keeping the feature), and
2. optimizing other parts gained 550%.
I am confused.
I never said your results aren't. I don't know where you get that from.
What I did say was that speeding up Perl by several percent overall in the space of about one hour of hacking is great.
By the way, something that wasn't entirely clear to me in your blog posts: To achieve the end results, did you have to resort to any manual changes to the generated code?
Any news on "To achieve the end results, did you have to resort to any manual changes to the generated code?"?
Steffen: I haven't implemented types and the simple state optimizations yet,
only stack optimizations for lexical arrays, -fno-magic and -fno-autovivification for B::CC.
This is the 6.5x faster figure I did in a few days.
I will not start with types until the status of acceptance will become clearer to me. So far I heard nothing. And const is more important than types. Both will cost me several months, and both are huge language changes.
nextstate will bring about 10%, so oplines will be a better path.
Will Braswell volunteered to implement loop unrolling, which will gain 100% with known loop sizes and some array accesses in the loop.
When the B::CC implementation looks stable, we can bring it back to op.c.
Hi Reini,
thanks for getting back to me. Though I do have to admit that I'm not sure whether you've answered my question. Might just be me misreading: Did you have to edit the generated code or not?
Separate topic: Out of curiosity, how do you intend to represent the unrolled loops within the context of the current perl5 VM (op.c) and at what time (which compilation phase, or even at run time) and what kind of loop could that pertain to? I understand that you're focussing on B::CC for now, so you probably don't have a ready made answer for all of the above. That's fine, I'm just trying to get a feel of where you'd be going technically.
Cheers,
Steffen
-funroll-loops is currently implemented by me and Will for B::CC only. See https://github.com/wbraswell/perl-compiler/commits/unroll-loops
This is done at compile-time by analyzing the optree, and changing the emitted C code, as this was the easiest to do without adding more dependencies to the compiler.
With B::C the optimization will be implemented as XS code, because I need to change some ops and copy the whole body without B::Generate.
For op.c the code will be quite simple, but at first we need to get it stable and get some heuristics when it's worth and what cases to skip.
Thanks for clarifying and good luck in the endeavor!