Optimizing compiler benchmarks (part 4)

By Reini Urban on October 8, 2012 9:03 PM

nbody - More optimizations

In the first part I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an regexp example which was very bad to optimize.

In the second part I got 2 times faster run-times with the B::CC compiler with the nbody benchmark, which does a lot of arithmetic.

In the third part I got 4.5 times faster run-times with perl-level AELEMFAST optimizations, and discussed optimising array accesses via no autovivification or types.

Optimising array accesses showed the need for autovivification detection in B::CC and better stack handling for more ops and datatypes, esp. aelem and helem.

But first let's study more easier goals to accomplish. If we look at the generated C source for a simple arithmetic function, like pp_sub_offset_momentum we immediately detect more possibilities.

static
CCPP(pp_sub_offset_momentum)
{
    SV *sv, *src, *dst, *left, *right;
    NV rnv0, lnv0, d1_px, d2_py, d3_pz, d4_mass, d7_tmp, d10_tmp, d13_tmp, d15_tmp, d17_tmp, d19_tmp, d21_tmp, d23_tmp, d25_tmp, d27_tmp, d29_tmp, d31_tmp, d33_tmp, d35_tmp, d37_tmp, d40_tmp, d42_tmp, d44_tmp;
    PERL_CONTEXT *cx;
    MAGIC *mg;
    I32 oldsave, gimme;
    dSP;
  lab_2a41220:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* only needed once */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[1]); /* not needed at all */
    d1_px = 0.00;
  lab_2a41370:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[2]); /* not needed at all */
    d2_py = 0.00;
  lab_2a50a00:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[3]); /* not needed at all */
    d3_pz = 0.00;
  lab_2a50b30:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[4]); /* not needed at all */
  lab_2a50cc0:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    PUSHs(AvARRAY(MUTABLE_AV(PL_curpad[5]))[0]);    /* no autovivification */
    sv = POPs;
    MAYBE_TAINT_SASSIGN_SRC(sv);    /* not needed */
    SvSetMagicSV(PL_curpad[4], sv); /* i.e. PL_curpad[4] = sv; */
    ...

We can study the expanded macros with:

cc_harness -DOPT -E -O2 -onbody.perl-2.perl-1.i nbody.perl-2.perl.c

TAINT_NOT does (PL_tainted = (0)). It is needed only once, because nobody changes PL_tainted. We can also ignore taint checks generally by setting -fomit_taint.

perl -MO=Concise,offset_momentum nbody.perl-2a.perl

main::offset_momentum:
42 <1> leavesub[1 ref] K/REFC,1 ->(end)
-     <@> lineseq KP ->42
1        <;> nextstate(main 141 (eval 5):4) v ->2
4        <2> sassign vKS/2 ->5
2           <$> const(NV 0) s ->3
3           <0> padsv[$px:141,145] sRM*/LVINTRO ->4
...

sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; is the 2nd part of the inlined code for nextstate and resets the stack pointer. As we keep track of the stack by ourselves we can omit most of these resets in nextstate.

FREETMPS is also part of nextstate, and calling it after each basic block is optimized by -O1, and -O2 would free the temps after each loop. If FREETMPS is needed at all, i.e. if locals are used in the function at all, is not checked yet.

SAVECLEARSV(PL_curpad[1-4]) is part of padsv /LVINTRO, but here unneeded, since it is in the context of sassign. So the value of the lexical does not need to be cleared before it is set. And btw. the setter of the lexical is already optimized to a temporary.

MAYBE_TAINT_SASSIGN_SRC(sv) is part of sassign and can be omitted with -fomit_taint, and since we are at TAINT_NOT we can leave it out.

SvSetMagicSV(PL_curpad[4], sv) is also part of the optimized sassign op, just not yet optimized enough, since sv cannot have any magic. A type declaration for the padsv would have used the faster equivalent SvNV(PL_curpad[4]) = SvNV(sv); put on the stack.

We can easily test this out by NOP'ing these code sections and see the costs.

With 4m53.073s, without 4m23.265s. 30 seconds or ~10% faster. This is now in the typical range of p5p micro-optimizations and not considered high-priority for now.

Let's rather check out more stack optimizations.

I added a new B::Stackobj::Aelem object to B::Stackobj to track aelemfast accesses to array indices, and do the PUSH/POP optimizations on them.

The generated code now looks like:

  lab_116f270:
    TAINT_NOT;
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
    FREETMPS;
    rnv0 = d9_mag; lnv0 = SvNV(AvARRAY((AV*)PL_curpad[25])[1]); /* multiply */
    d3_mm2 = lnv0 * rnv0;
  lab_116be90:
    TAINT_NOT;
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
    FREETMPS;
    d5_dx = SvNV(PL_curpad[5]);
    rnv0 = d3_mm2; lnv0 = d5_dx;    /* multiply */
    d29_tmp = lnv0 * rnv0;
    SvNVX(AvARRAY((AV*)PL_curpad[28])[0]) = SvNVX(AvARRAY((AV*)PL_curpad[28])[0]) - d29_tmp;

Lvalue assignments need SvNVX, right-value can keep SvNV. The multiply op for PL_curpad[28])[0] has the OPf_MOD flag since the first arg is modified. nextstate with TAINT, FREETMPS and sp reset is still not optimized.

Performance went from 4m53.073s to 3m58.249s, 55s or 18.7% faster. Much better than with the nextstate optimizations. 30s less on top of this would be 3m30s, still slower than Erlang, Racket or C#. And my goal was 2m30s.

But there's still a lot to optimize (loop unrolling, aelem, helem, ...) and adding the no autovivification check was also costly. Several dependant packages were added to the generated code, like autovivification, Tie::Hash::NamedCapture, mro, Fcntl, IO, Exporter, Cwd, File::Spec, Config, FileHandle, IO::Handle, IO::Seekable, IO::File, Symbol, Exporter::Heavy, ... But you don't see this cost in the binary size, and neither in the run-time.

I also tested the fannkuchredux benchmark, which was created for a bad LISP compiler in 1994, also with array accessors.

Uncompiled with N=10 I got 16.093s, compiled 9.1222s, almost 2x times faster (1.75x). And this code has the same aelem problem as nbody, so a loop unrolling to aelemfast and better direct accessors with no-autovivification would lead to a ~4x times faster run-time.

nextstate optimisations

nextstate and its COP brother dbstate are mainly used to store the line number of the current op for debugging. I wrote an oplines patch already to move the line info to all OPs, which reduced the need for 90% nextstate ops, which would overcome the problem we are facing here:

    PL_op = &curcop_list[0];
    TAINT_NOT; /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* rarely needed */
    FREETMPS; /* rarely needed, only with TMPs */

oplines is not yet usable because it only reduces the number of nextstate ops, but I haven't written the needed change to warnings and error handling which would be needed to search for the current cop with warn or die, to be able to display the file name together with the line number.

A different strategy would be to create simplier state COPs, without TAINT check, without stack reset and without FREETMPS. Like state, state_t, state_s, state_f, state_ts, state_sf, state_tsf == nextstate.

TBC...

9 comments

9 Comments

Steffen Mueller | October 9, 2012 8:01 AM

A general 10% speed-up (run-time, not start-up) would very much be considered a significant gain by a significant number of people on p5p. That includes at least Yves Orton and me, plus several others that I feel less confident to name.

Specifically, if it's a generally applicable change, I am quite confident that if somebody were willing to do the actual work (including shepherding it through peer review and getting it applied, with my help), I could find some financial sponsorship for the effort.

10% is a big fucking deal, if it's an overall speedup on a macroscopic, meaningful program.

Reini Urban | October 14, 2012 5:11 PM

I concur that 10% would be a big deal to some, but my current results of 550% are not?

Why on hell are you proposing omitting taint checks at all, when
1. optimizing TAINT checks would gain 10% (with keeping the feature), and
2. optimizing other parts gained 550%.

I am confused.

Steffen Mueller | October 15, 2012 11:08 PM

I never said your results aren't. I don't know where you get that from.

What I did say was that speeding up Perl by several percent overall in the space of about one hour of hacking is great.

By the way, something that wasn't entirely clear to me in your blog posts: To achieve the end results, did you have to resort to any manual changes to the generated code?

Steffen Mueller | October 17, 2012 7:52 AM

Any news on "To achieve the end results, did you have to resort to any manual changes to the generated code?"?

Reini Urban | October 18, 2012 4:12 AM

Steffen: I haven't implemented types and the simple state optimizations yet,
only stack optimizations for lexical arrays, -fno-magic and -fno-autovivification for B::CC.
This is the 6.5x faster figure I did in a few days.

I will not start with types until the status of acceptance will become clearer to me. So far I heard nothing. And const is more important than types. Both will cost me several months, and both are huge language changes.

nextstate will bring about 10%, so oplines will be a better path.

Reini Urban | October 18, 2012 4:15 AM

Will Braswell volunteered to implement loop unrolling, which will gain 100% with known loop sizes and some array accesses in the loop.

When the B::CC implementation looks stable, we can bring it back to op.c.

Steffen Mueller | October 18, 2012 11:03 AM

Hi Reini,

thanks for getting back to me. Though I do have to admit that I'm not sure whether you've answered my question. Might just be me misreading: Did you have to edit the generated code or not?

Separate topic: Out of curiosity, how do you intend to represent the unrolled loops within the context of the current perl5 VM (op.c) and at what time (which compilation phase, or even at run time) and what kind of loop could that pertain to? I understand that you're focussing on B::CC for now, so you probably don't have a ready made answer for all of the above. That's fine, I'm just trying to get a feel of where you'd be going technically.

Cheers,
Steffen

Reini Urban | November 1, 2012 9:00 PM

-funroll-loops is currently implemented by me and Will for B::CC only. See https://github.com/wbraswell/perl-compiler/commits/unroll-loops

This is done at compile-time by analyzing the optree, and changing the emitted C code, as this was the easiest to do without adding more dependencies to the compiler.

With B::C the optimization will be implemented as XS code, because I need to change some ops and copy the whole body without B::Generate.

For op.c the code will be quite simple, but at first we need to get it stable and get some heuristics when it's worth and what cases to skip.

Steffen Mueller | November 7, 2012 6:41 PM

Thanks for clarifying and good luck in the endeavor!

About Reini Urban

Working at cPanel on cperl, B::C (the perl-compiler), parrot, B::Generate, cygwin perl and more guts, keeping the system alive.

More info »

Reini Urban