September 2012 Archives

My perl5 TODO list

Below is a formal list of possible optimizations, which most would agree on. We had these discussion in 2001 with damian were perl6 and perl5i took off. I'd like to work on these for perl5 core and need decisions. Most p5p hackers seem to be informed about the general possibilities and directions, but not all. We'd need this to improve general perl5 performance, and also help static compilation.[1]

We had this before, so I'd like to keep it formal. So each proposal gets a perl6-like name, and replies should change the subject to that name. I choose PDD for "Perl Design Draft".

Beforehand: "compiler" means op.c not B::C. compile-time and run-time should be obvious.

PDD01 const / readonly lexicals

The CONST op currently is a SVOP, holding a global gvsv. A CONST op might hold lexicals also, a PADOP type. The more constants the compiler knows at compile-time the better it can optimize. The following datatypes need to be represented as const:

  • PADSV (lexicals and esp. function arguments)

  • "PDD02 final classes - const @ISA"

  • "PDD03 immutable classes - const %class::"

Esp. readonly function arguments need to be parsed into lexical consts, but "my const $i" or "my $i:ro" also. I have no opinion on "my $i is ro", but it would be the best choice. See "PDD05 Function and method signatures"

Datatypes:

SVt_READONLY already is good enough to hold this information in the data. But the compiler does not want to optimize on datatypes, the information needs to represented as OP. Just for the special cases @ISA and stashes it is not needed.

So either add a mixin svop+padop type for CONST decriminated by OPpCONST_PAD 1, add a CONST flag to PADSV,

or add a new CONSTPAD op, replacing PADSV/const which needs to be added into all current CONST checks in the compiler.

CONST with OPpCONST_PAD flag:

Pro: Easier and faster for the compiler.

Contra: The logic for the new OP type which is a union of SVOP and PADOP needs to be added for all accessors. B and its libraries, but also XS walkers.

PADSV with OPpPAD_CONST flag

Pro: Does not break libraries

Contra: CONST checks need to check PADSV's also.

CONSTPAD:

Pro: Does not break libraries

Contra: CONST checks need to check CONSTPAD also.

Personally I lean against CONSTPAD.

Keywords: (how to parse)

The following variants are being considered: lexicals and globals:

my const $i; my const ($i, $j) = (0, 1);   (as const keyword upfront)
my $i :ro;
my $i is ro;

See "PDD05 Function and method signatures"

sub call (const $i) {}
sub call ($i:ro) {}
sub call ($i is ro) {}

See "PDD02 final classes - const @ISA"

const our @ISA = ('MyBase');
our @ISA :ro = ('MyBase');
our @ISA is ro = ('MyBase');
class MyClass is final {
  our @ISA = ('MyBase');
}
class MyClass (extends => ('MyBase'), is_final => 1) {}

See "PDD03 immutable classes - const %class::"

const package MyClass { } and const package MyClass;
const %MyClass::;
class MyClass is immutable {}
class MyClass (is_immutable => 1) {}

No keyword. immutable should be the new default for the class keyword, old-style packages stay mutable.

Keyword discussion:

The type qualifier const, which creates CONST/CONSTPAD op and sets the SVf_READONLY flag can be represented either as new keyword "const", which looks most natural, but is hardest to parse. Larry opposed it initially, because it looked to C++ish. But nowadays it looks best.

The attribute it would be easiest to parse, as a MYTERM also parses and handles attributes, The MYTERM type just needs be extended for signatures. It also looks natural.

The perl6-like type trait is harder to parse, and a bit unnatural for lexicals.

The Moose style hash attributes only work for classes, not for lexicals and sigs.

PDD02 final classes - const @ISA

A const isa is commonly known as "final" keyword. The class is not extendable, the compiler can do compile-time method resolution, i.e. convert a method to a function.

Pro: Compile-time method resolution

If the compiler knows at compile-time for each method, that all isa's until the method is found are const and also those classes are immutable (const), the method can be converted to a function. That would be a huge performance win, esp. with classes with favor methods over hash accessors.

Note that the accessor typo problem could also be solved with const hashes of the object representation, but nobody is using that yet. A const class (const %classname::) not, as this is independent of the underlying object representation, which is usually a blessed hash.

Function calls are slow, and method calls even 10% minimum slower. (10% for immediately found methods, for a deeper search the run-time costs are higher)

Contra:

I hope the "final" problem is known from java. Since the compiler needs to know in advance the inheritances it is not possible to extend and override methods of final classes. One cannot extend java strings. Thanks to Michael Schwern for the discussion.

Solutions:

  1. (Reini): Define the following convention. No additional keywords needed. Libraries may use final, but finalization is defered until the application is processed, and all libraries (use statements) are already loaded. So mocking is still possible, but the default is to use compile-time method resolution. Schwern sees a problem in that scheme which I haven't understood yet.

  2. (Larry): Libraries may use final, but the application with a #pragma final has the final word.

See also pddtypes.pod

PDD03 immutable classes - const %class::

Classes should default to immutable, packages keep the dynamic behaviour unless a package is declared as const. (Damian)

Some might know from Moose that immutable classes makes it 20x faster, even if not all possible optimizations are yet done.

PDD04 Types

They are already parsed for lexicals, just not for named arguments. The 3 coretypes int, num, string need to be reserved. p5-mop will probably define more. bool needs to be added probably also.

Type conventions in core are needed to

  1. talk to other languages, like json, perl6 or java,

  2. to specify the wanted behavior for methods acting on types, such as smartmatch or multi-methods, or

  3. for special performance purposes, e.g. int loop counters, int arithmetic, smaller and faster typed arrays or hashes, or to enforce compile-time method resolution.

See pddtypes.pod and perltypes.pod I had an old version at my blog and at YAPC

An initial benefit would be natively typed arrays and hashes in core, with const hashes even optimizable hashes (so called "perfect hashes"). Further type checks and optimizers are left to modules.

PDD04.1 CHECK_SCALAR_ATTRIBUTES

Compile-time attribute hook for our three types to be able to use attributes for my declarations.

Note: Attributes still suffer from an over-architectured and broken Attribute::Handler implementation which evals the attribute value.

our $name:help(print the name); will call eval "print the name";

Without fixing this, attributes will have no chance to be accepted. The syntax is nice, and it is already parsed.

PDD05 Function and method signatures

The current prototype syntax explictly allows named arguments. There are several implementations already.

But there are several decisions required.

In order to optimize function and method calls, we need to define type qualifiers, and eventually return types, even if they are not used yet.

New syntax allows changing the semantics.

Lets follow perl6:

  • is bind (default) vs is copy (old semantics)

  • is ro (default) vs is rw (old semantics)

  • allow passing types and attributes to functions. attributes allow user-define hooks as now, just on function entries, not on variable declarations.

Optional arguments are defined by specifying defaults.

If we do not follow perl6 syntax with "is", we need attributes to specify ":rw" and possibly "\$" to specify bind (by reference).

e.g. sub myadder (\$i, $num = 1) { $i =+ $num }
or   sub myadder ($i:rw, $num = 1) { $i =+ $num }

bind ro is by far the fastest calling convention. optimizable and checkable by the compiler. copy is the safe way, rw uses the old $_[n] semantics.

I outlined my proposal in pddtypes.pod

Q: Do function args and return values keep constness?

A: Only function args by ref. This is current behaviour and makes sense.

PDD06 Function return types

Any optimizer needs to stop if a function return type is not known. We don't even know if any value is returned at all, so we have to check @_ at every LEAVESUB, though the parser knows the context information already. By optionally declaring return types, a type checker and optimizer can kick in. Esp. for coretypes like int, num, str or void or a const qualifier.

There exist old and wildly different syntaxes for return types, but they are unused. Use the perl6 syntax, which is c-like.

Q What about libraries declaring their return values constant? I cannot change them then and have to copy them?

A: No. Return values so far are not const. Only if you declare a function to return a const it will be so.

PDD07 Compile-time entersub optimizations

Calling a function via ENTERSUB and cleaning up at LEAVESUB is by far the slowest part of perl.

We can check our functions for the following situations: exceptions, jumps out, lexicals, locals, function calls, recursive calls.

If none of these occur, the function can be inlined.

We also need to check for tail calls and arguments. (signatures)

If no exceptions or no locals occur the parts in ENTERSUB and LEAVESUB which deal with that can be skipped.

We need to store the context and possible return type in ENTERSUB and LEAVESUB to speed up @_ handling.

We need to seperate XS calls from ENTERSUB.

PDD08 Compile-time op-arg optimizations

Our current optree resolves op argument types (the compile-time op flags and also the POP'ed flags) at run-time. For the cases the op itself specifies the behavior or the argument type can be compile-time deferred (lvalue, context, magic, ...), an optimized op version should be used.

Promote type pessimization to all affected ops, and use optimized ops for non-pessimized. Similar to i_opt (integer constant folding) if all operands are non-magic IVs.

The biggest blocker are functions borders. Without named arguments passed as bind (alias), each function must optimize from scratch and looses all information.

PDD09 Compile-time function inlining

See "PDD07 Compile-time entersub optimizations". entersub (and leavesub) needs to hold compiler information about the function, which requires waiting for parsing all embedded functions.

Even functions with arguments can be inlined, for safe versions with arguments by copy, and destructive arguments by bind. They just need a scope block.

PDD10 Compile-time method resolution

We can easily change run-time method calls at compile-time to function calls. What is left is a decision on "PDD02 final classes - const @ISA" and "PDD03 immutable classes - const %class::"

Outlined here how-perl-calls-subs-and-methods and further refined at "Compile-time type optimizations" in perltypes

PDD11 Compile-time method inlining

This just does method resolution (change to functions) and then does function inlining.

PDD12 Run-time method caching

This is trivial as there are already isa change hooks. METHOD_NAMED and METHOD just need a check a global method or object cache.

PDD13 Multi

multi needs types. (As smartmatch needs types to work reliably.)

As for the syntax multi can be implemented traditionally where the compiler generates the different methods per types automatically, or the perl6 way, with a seperate keyword. I see no problem with the first approach. This would need no new keyword.

PDD14 MOP

The current MOP discussion and opinion is mainly about the new class and method keywords, but a MOP has nothing to do with that. Also not with Moose or a new object system. A MOP allows the definition of new behaviour for classes, methods, attributes, types, roles, inheritance and so on. How they are initialized, the layout, the behavior. A definition of alternate object systems. It is mainly proposed to overcome a Moose problem with anonymous packages, to seperate classes from stashes.

Introducing a MOP is good if the current object system is not good enough. The current object system is not good enough for Moose, and should be improved. There need to be two seperate discussions. One about what improvements Moose needs from the traditional stash based objects (global vs lexical namespaces - anon Packages), and the second about the MOP itself.

I have no opinion on the mop. Just this: Why bother with a mop before some basic langauge features are not yet decided upon? Moose does not even use types properly yet. This smells for premature hooks. But pmichaud is highly convinced that a p5 mop is a good thing.

PDD20 no vivify

Something like autovivification needs to get added to improve the optree. As shown in optimizing-compiler-benchmarks-part-3 disabling vivification of arrays but also hashes will lead to compile-time optimizations and dramatic performance improvements, similar to const arrays or hashes, but even better.

PDD21 no magic

Similar to no vivify or const lexicals, a lexical no magic pragma can lead to compile-time optimizations and dramatic performance improvements.

PDD22 slimmer nextstate

Slimmer nextstate op variants can be optimized at compile-time, which do not: reset PL_taint, the stack pointer and FREETMPS.

PDD23 loop unrolling

As shown in optimizing-compiler-benchmarks-part-2 AELEMFAST is about 2 times faster than the generic AELEM, but it needs to know the index at compile-time. This is easy to do for loops.

Unroll loops with known size and lots of AELEM into AELEMFAST accesses automatically.

PDD30 Alternative parser

The worst part of perl is the parser. It is a hack, it is fast, but changing and esp. adding rules in a sane manner is hard, because the parser deviates in too many ways from a lexer/tokenizer seperation. For adding new syntax you usually cannot just add the syntax rules to perly.y

Second generating a traditional AST which generates a better optree (better optimizable, or emit jit or emit native code) is worthwile.

PDD31 Alternative vm

Our VM is a stack machine, which handles the stack on the heap. There are no typed alternatives.

There are integer optimized opts, but they are rarely used, "use integer" and "my int" can overcome this, but overflow behaviour needs to be defined. Either slow promotion to number or fast integer wrap, unsigned or signed. With "my int" this behaviour can be changed.

The VM is simple and easy to XS, but has major problems. An alternative VM could be based on parrot or vmkit or simply reuse the existing ops, with a different compiler and different stack handling.

A c-stack based compiler could arrange the optree as a natively compiled or jit'ed C program. Before each op call the op arguments (0-2 SV pointers) are put on the stack, lexicals also as in native closures, and functions are called natively via cdecl or stdcall, depending on if we need varargs.

By using LLVM even a register based (fastcall) layout can be arranged.

PDD32 Jit

A jit could solve the run-time decisions for dynamic cases, which are not solvable at compile-time. But the vm should be JIT friendly. The current VM is quite jit-friendly, but the ops itself are too dynamic. There need to be pre-compiled optimized alternatives for certain ops with known argument types.

To be practical I'm thinking of adding labels with a naming scheme to most ops, where a JIT or LLVM could hook into.

Just some random examples from pp.c, to give you an idea.

PP(pp_pos)
{
    dVAR; dSP; dPOPss;

    if (PL_op->op_flags & OPf_MOD || LVRET) {
      pp_pos_mod:
      SV * const ret = sv_2mortal(newSV_type(SVt_PVLV));
      sv_magic(ret, NULL, PERL_MAGIC_pos, NULL, 0);
      LvTYPE(ret) = '.';
      LvTARG(ret) = SvREFCNT_inc_simple(sv);
      PUSHs(ret);    /* no SvSETMAGIC */
      RETURN;
    }
    else {
      if (SvTYPE(sv) >= SVt_PVMG && SvMAGIC(sv)) {
        pp_pos_mg:
        const MAGIC * const mg = mg_find(sv, PERL_MAGIC_regex_global);
        if (mg && mg->mg_len >= 0) {
          dTARGET;
          I32 i = mg->mg_len;
          if (DO_UTF8(sv))
            sv_pos_b2u(sv, &i);
          PUSHi(i);
          RETURN;
        }
      }
      RETPUSHUNDEF;
    }
}

PP(pp_refgen)
{
    dVAR; dSP; dMARK;
    if (GIMME != G_ARRAY) {
      pp_refgen_gimme_not_array:
      if (++MARK <= SP)
        *MARK = *SP;
      else
        *MARK = &PL_sv_undef;
      *MARK = refto(*MARK);
      SP = MARK;
      RETURN;
    }
    pp_refgen_gimme_array:
    EXTEND_MORTAL(SP - MARK);
    while (++MARK <= SP)
      *MARK = refto(*MARK);
    RETURN;
}

Footnotes:

  1. "Ertl and Gregg analyze the performance of the following interpreters: Gforth, OCaml, Scheme48, Yap, Perl, Xlisp. While Gforth, OCaml, Scheme48 and Yap are categorized as efficient interpreters, Perl and Xlisp benchmarks are used for comparison purposes as inefficient interpreters.

While efficient interpreters perform with a slowdown by a factor of 10 when compared to an optimizing native code compiler, inefficient interpreters have a slowdown by a factor of 1000."

M. Anton Ertl and David Gregg. The structure and performance of efficient interpreters. Journal of Instruction-Level Parallelism, 5:1­25, November 2003. Cited on pages 6 and 7. https://students.ics.uci.edu/~sbruntha/cgi-bin/download.py?key=thesis

Optimizing compiler benchmarks (part 1)

Since my goal is to improve the compiler optimizer (staticly with B::CC, but also the perl compiler in op.c) I came to produce these interesting benchmarks.

I took the regex-dna example from "The Computer Language Benchmarks Game" at shootout.alioth.debian.org/

$ time perl t/regex-dna.pl <t/regexdna-input
agggtaaa|tttaccct 0
[cgt]gggtaaa|tttaccc[acg] 3
a[act]ggtaaa|tttacc[agt]t 9
ag[act]gtaaa|tttac[agt]ct 8
agg[act]taaa|ttta[agt]cct 10
aggg[acg]aaa|ttt[cgt]ccct 3
agggt[cgt]aa|tt[acg]accct 4
agggta[cgt]a|t[acg]taccct 3
agggtaa[cgt]|[acg]ttaccct 5

101745
100000
133640

real    0m**0.130s**  /(varying from 0.125 to 0.132)/
user    0m0.120s
sys     0m0.008s

t/regexdna-input contains 100KB 1600 lines of DNA code, which is used to match DNA 8-mers and substitute nucleotides for IUB codes.

$ wc t/regexdna-input 
1671   1680 101745 t/regexdna-input

Perl behaves pretty good in this benchmark, it is actually the fastest scripting language. But the compiler should do better, and I had some ideas to try out for the optimizing compiler. So I thought.

First the simple and stable B::C compiler with -O3:

$ perlcc -O3 -o regex-dna-c -S t/regex-dna.pl
$ time ./regex-dna-c <t/regexdna-input
agggtaaa|tttaccct 0
[cgt]gggtaaa|tttaccc[acg] 3
a[act]ggtaaa|tttacc[agt]t 9
ag[act]gtaaa|tttac[agt]ct 8
agg[act]taaa|ttta[agt]cct 10
aggg[acg]aaa|ttt[cgt]ccct 3
agggt[cgt]aa|tt[acg]accct 4
agggta[cgt]a|t[acg]taccct 3
agggtaa[cgt]|[acg]ttaccct 5

101745
100000
133640

real    0m**0.285s**
user    0m0.272s
sys     0m0.004s

0.130s vs 0.285s compiled? What's going on? B::C promises faster startup-time and equal run-time. With -S we keep the intermediate C source to study it. Let's try B::CC, via -O. Here you don't need a -O3 as B::CC already contains all B::C -O3 optimizations

$ perlcc -O -o regex-dna-cc t/regex-dna.pl
$ time ./regex-dna-cc <t/regexdna-input
...
real    0m**0.267s**
user    0m0.256s
sys     0m0.008s

Hmm? Let's see what's going on with -v5.

$ perlcc -O3 -v5 -S -oregex-dna-c -v5 t/regex-dna.pl

script/perlcc: Compiling t/regex-dna.pl
script/perlcc: Writing C on regex-dna-c.c
script/perlcc: Calling /usr/local/bin/perl5.14.2d-nt -Iblib/arch -Iblib/lib -MO=C,-O3,-Dsp,-v,-oregex-dna-c.c t/regex-dna.pl
Starting compile
 Walking tree
 done main optree, walking symtable for extras
 Prescan 0 packages for unused subs in main::
 %skip_package: B::Stackobj B::Section B::FAKEOP B::C B::C::Section::SUPER B::C::Flags
 B::Asmdata O DB B::CC Term::ReadLine B::Shadow B::C::Section B::Bblock B::Pseudoreg
 B::C::InitSection B::C::InitSection::SUPER
 descend_marked_unused: 
...
%INC and @INC:
 Delete unsaved packages from %INC, so run-time require will pull them in:
 Deleting IO::Handle from %INC
 Deleting XSLoader from %INC
 Deleting B::C::Flags from %INC
 Deleting B::Asmdata from %INC
 Deleting Tie::Hash::NamedCapture from %INC
 Deleting B::C from %INC
 Deleting SelectSaver from %INC
 Deleting IO::Seekable from %INC
 Deleting base from %INC
 Deleting Config from %INC
 Deleting B from %INC
 Deleting Fcntl from %INC
 Deleting IO from %INC
 Deleting Symbol from %INC
 Deleting O from %INC
 Deleting Carp from %INC
 Deleting mro from %INC
 Deleting File::Spec::Unix from %INC
 Deleting FileHandle from %INC
 Deleting Exporter::Heavy from %INC
 Deleting strict from %INC
 Deleting Exporter from %INC
 Deleting vars from %INC
 Deleting Errno from %INC
 Deleting File::Spec from %INC
 Deleting IO::File from %INC
 Deleting DynaLoader from %INC
 %include_package: warnings warnings::register
 %INC: warnings.pm warnings/register.pm
 amagic_generation = 1
 Writing output
 Total number of OPs processed: 323
 NULLOP count: 8

%include_package contains: warnings warnings::register. These two cost a lot of time. Carp is also a nice example of code bloat for the static compiler.

Let's try without:

$ perlcc -O3 -Uwarnings -Uwarnings::register -S -oregex-dna-c1  t/regex-dna.pl
$ wc regex-dna-c.c
2293  16084 128953 regex-dna-c.c
$ wc regex-dna-c1.c
1201  7488 57236 regex-dna-c1.c

128953 down to 57236 bytes. Double size with warnings. So lot of startup-time overhead.

$ perlcc -O -O2 -Uwarnings -Uwarnings::register -S -oregex-dna-cc1 t/regex-dna.pl

$ time ./regex-dna-c1 <t/regexdna-input
...
real    0m**0.284s**
user    0m0.271s
sys     0m0.004s

$ time ./regex-dna-cc1 <t/regexdna-input
...
real    0m**0.266s**
user    0m0.255s
sys     0m0.008s

Not much gain by stripping warnings, since the main part is run-time, startup-time is usually 0.010 (uncompiled) to 0.001 (compiled).

Wait, what perl is perlcc calling at all? Hopefully the same as perl. Nope. As it turns out perlcc was compiled debugging, and comparing debugging perls with non-debugging explains double run-time. You see it with -v in the output above /usr/local/bin/perl5.14.2d-nt, which is my naming perlall-derived convention for debugging non-threaded.

Recompiling the compiler with normal perl, and re-testing:

$ perl -S perlcc -O3 -Uwarnings -Uwarnings::register -S -oregex-dna-c1  t/regex-dna.pl
$ perl -S perlcc -O -O2 -Uwarnings -Uwarnings::register -S -oregex-dna-cc1  t/regex-dna.pl

$ time ./regex-dna-c1 <t/regexdna-input
...
real    0m0.127s
user    0m0.124s
sys     0m0.000s

$ time ./regex-dna-cc1 <t/regexdna-input
...
real    0m0.121s
user    0m0.120s
sys     0m0.008s

0.130s vs 0.127s (compiled) vs 0.121s (optimizing compiled) makes now sense. But not much room to improve here, as the regex engine already has a pretty good DFA (not the fastest as re::Engine::RE2 would be faster) but is not optimizable by the optimizing compiler.

Better optimize numbers. Tomorrow. I want to improve stack smashing in B::CC. Getting rid of copying intermediate C values from the C stack and back to the perl heap.

See the arithmetic part 2

Reading binary floating-point numbers (numbers part2)

As explained in my previous blog post about parrot and numbers parrot writes floating-pointing numbers in the native format and reads foreign floating-point numbers in different formats.

What kind of floating-point formats exist?

I'm only studying the commonly used base 2 encodings, usually called double. Base 10 encodings decimal32, decimal64 and decimal128 also exist.

IEEE-754 defines half-precision (binary16), float (binary32), double (binary64) and quad float (binary128). It does not define the most popular format long double, esp. not the intel extended precision format, which you normally would associate with long double. There is a IEEE-754 long double but this only works on sparc64 and s390.

And since IEEE-754 long double is almost never used and hard to implement in silicon, other architectures deviated wildly also.

  • Intel uses 80-bit (10 byte) for its 12 or 16-byte long double. A complete different representation to IEEE-754.

  • Powerpc uses two 8-byte doubles for its 16-byte long double. The result is the sum of the two. double-double.

  • MIPS uses a different binary format to represent NaN, and Inf

  • I'm not so sure yet about AIX/S390 NaN.

  • sparc64 implement IEEE-754 quad float (binary128) properly, it can store it in %q registers, but the arithmetic is done in SW.

I am choosing little-endian representation here, but big-endian uses the same algorithm and code. When reading different endianness, just byteswap it before you are doing the conversion.

4-byte single FLOAT / IEEE-754 binary32

This is single precision, and only used in tiny machines. It is not even fast to compute, unless your HW is optimized to do float. double is normally faster than float, since double is HW supported.

It uses 4 byte, 32 bit. 8 bits for the exponent, and 23 bits for the mantissa. It can preserve 7-9 decimal digits.

   sign    1 bit  31
   exp     8 bits 30-23     bias 127
   frac   23 bits 22-0

+[3]----+[2]----+[1]----+[0]----+
S|  exp  |   fraction           |
+-------+-------+-------+-------+
1|<--8-->|<---23 bits---------->|
<-----------32 bits------------->

The so-called significand is the 23 fraction bits, plus an implicit leading bit which is always 1 unless the exponent bits are all 0. So the total precision is 24 bit, log10(2**24) ≈ 7.225 decimal digits

s: significand

e=0x0,  s=0:  => +-0.0
e=0xff, s=0:  => +-Inf
e=0xff, s!=0: => NaN

It is particularly odd that the sign + exponent does not align to the first byte, the exponent overlaps to the last bit of the second byte. So you have to mask off the exp and fraction.

A simple conversion to double is best done as compiler cast. Every compiler can do float and double.

cvt_num4_num8(unsigned char *dest, const unsigned char *src)
{
    float f;
    double d;
    memcpy(&f, src, 4);
    d = (double)f;
    ROUND_TO(d, double, 7);
    memcpy(dest, &d, 8);
}

This is a problematic case. double has more precision than float, so the result needs to be rounded to 7-9 digits.

8-byte DOUBLE float / IEEE-754 binary64

This is double precision, the most popular format. It uses 8 byte, 64 bit. 11 bits for the exponent, and 52 bits for the mantissa. It can preserve 15-16 decimal digits, DBL_DIG.

   sign    1 bit  63
   exp    11 bits 62-52     bias 1023
   frac   52 bits 51-0      (53 bit precision implicit)

+[7]----+[6]----+[5]----+[4]----+[3]----+[2]----+[1]----+[0]----+
S|   exp   |                  fraction                          |
+-------+-------+-------+-------+-------+-------+-------+-------+
1|<---11-->|<---------------------52 bits---------------------->|
<---------------------------64 bits----------------------------->

Precision: log10(2**53) ≈ 15.955 decimal digits. The first fraction bit is assumed to be 1 unless the exponent is 0x7ff

s: significand

e=0x0,   s=0:  => +-0.0
e=0x7ff, s=0:  => +-Inf
e=0x7ff, s!=0: => NaN

3ff0 0000 0000 0000   = 1
4000 0000 0000 0000   = 2
8000 0000 0000 0000   = -0
7ff0 0000 0000 0000   = Inf
fff0 0000 0000 0000   = -Inf
3df5 5555 5555 5555   ~ 1/3

Read into single float:

cvt_num8_num4(unsigned char *dest, const unsigned char *src)
{
    float f;
    double d;
    memcpy(&f, src, 8);
    f = (float)d;
    memcpy(dest, &d, 4);
}

No rounding problems.

Read into native long double:

cvt_num8_numld(unsigned char *dest, const unsigned char *src)
{
    double d;
    long double ld;
    memcpy(&d, src, 8);
    ld = (long double)d;
    memcpy(dest, &ld, sizeof(long double));
}

No rounding problems, as the compiler cast should handle that. Note that "native long double" can be up to 6 different binary representations, i386, amd64+i64, ppc, mips, aix or sparc64+s390.

80bit, 10-byte intel extended precision LONG DOUBLE

This is stored as 12-byte on i386 or 16 byte on x86_64 and itanium, however internally the format is still the old x87 extended precision 10 byte.

It uses 10 byte, 80 bit. 15 bits for the exponent, and 63 bits for the mantissa. It can preserve 17-19 decimal digits, LDBL_DIG.

   padding 2 or 4 byte (i386/x86_64)
   sign    1 bit  79
   exp    15 bits 78-64     bias 16383
   intbit  1 bit  63        set if normalized
   frac   63 bits 62-0

+[11]---+[10]---+[9]----+[8]----+[7]----+[6] ...+[1]----+[0]----+
|   unused      |S|     Exp     |i|          Fract              |
+-------+-------+-------+-------+-------+--- ...+-------+-------+
|<-----16------>|1|<-----15---->|1|<---------63 bits----------->|
<-------------->|<----------------80 bits----------------------->

Precision: log10(2**63) ≈ 18.965 decimal digits. The first fraction bit is here explicitly used, not hidden as before.

s: significand = frac. Looks like the Norwegian Fjord designer also helped out here. Note that this was a private clean-room design, not design by committee.

e=0x0,    i=0, s=0:  => +-0.0
e=0x0,    i=0, s!=0: => denormal
e=0x0,    i=1:       => pseudo denormal (read, but not generated)
e=0x7fff, bits 63,62=00, s=0  => old +-Inf, invalid since 80387
e=0x7fff, bits 63,62=00, s!=0 => old NaN, invalid since 80387
e=0x7fff, bits 63,62=01:      => old NaN, invalid since 80387
e=0x7fff, bits 63,62=10, s=0  => +-Inf
e=0x7fff, bits 63,62=10, s!=0 => NaN
e=0x7fff, bits 63,62=11, s=0  => silent indefinite NaN (internal Inf, 0/0, ...)
e=0x7fff, bits 63,62=11, s!=0 => silent NaN

i=0: => denormal (read, but not generated)
i=1: => normal

Reading this number into a double is tricky:

cvt_num10_num8(unsigned char *dest, const unsigned char *src)
{
    int expo, i, sign;
    memset(dest, 0, 8);
    /* exponents 15 -> 11 bits */
    sign = src[9] & 0x80;
    expo = ((src[9] & 0x7f)<< 8 | src[8]);
    if (expo == 0) {
      nul:
        if (sign)
            dest[7] |= 0x80;
        return;
    }
    expo -= 16383;       /* - bias long double */
    expo += 1023;        /* + bias for double */
    if (expo <= 0)       /* underflow */
        goto nul;
    if (expo > 0x7ff) {  /* inf/nan */
        dest[7] = 0x7f;
        dest[6] = src[7] == 0xc0 ? 0xf8 : 0xf0 ;
        goto nul;
    }
    expo <<= 4;
    dest[6] = (expo & 0xff);
    dest[7] = (expo & 0x7f00) >> 8;
    if (sign)
        dest[7] |= 0x80;
    /* long double frac 63 bits => 52 bits
       src[7] &= 0x7f; reset intbit 63 */
    for (i = 0; i < 6; ++i) {
        dest[i+1] |= (i==5 ? src[7] & 0x7f : src[i+2]) >> 3;
        dest[i] |= (src[i+2] & 0x1f) << 5;
    }
    dest[0] |= src[1] >> 3;
}

No rounding problems.

Reading an intel long double into a IEEE-754 quad double (__float128) is similar, but the difference counts.

cvt_num10_num16(unsigned char *dest, const unsigned char *src)
{

    int expo, i;
    memset(dest, 0, 16);
    dest[15] = src[9]; /* sign + exp */
    dest[14] = src[8];
    expo = ((src[9] & 0x7f)<< 8 | src[8]);
    expo -= 16383;
    /* On Intel expo 0 is allowed */
    if (expo <= 0)     /* underflow */
        return;
    if (expo > 0x7ff)  /* overflow, inf/nan */
        return;
    /* shortcut the zero mantissa check */
#if __WORDSIZE == 64
    if (*(const uint64_t*)src != 0x8000000000000000LU)
#else
    if (*(const uint32_t*)src || *(const uint32_t*)&src[4] != 0x80000000U)
#endif
    {
      for (i = 13; i > 5; i--) {
          dest[i] |= ((i==13 ? src[7] & 0x7f : src[i-5]) << 1)
                  | (src[i-6] & 0x7f) >> 7;
      }
    }
    ROUND_TO((__float128)*dest, __float128, 18);
}

Need to properly round the result into the Intel LDBL_DIG precision (18). Cut off the rest.

Powerpc 16-byte LONG DOUBLE aka "double-double"

With -mlong-double-128 a long double on ppc32 or ppc64 is stored in 16 bytes. It simple stores two double numbers, a "head" and a "tail", one after another. The head being rounded to the nearest double and the tail containing the rest.

It stores 106 bits significants (2*53), but the range is limited to the double format, 11-bit. It can preserve 31 decimal digits, LDBL_DIG. log10(2**106) ≈ 31.909 decimal digits

Reading such a number is trivial, just return the sum of the two 8-byte doubles. There is only one special case: -0.0

cvt_num16ppc_num8(unsigned char *dest, const unsigned char *src)
{
    double d1, d2;
    long double ld;
    memcpy(&d1, src, 8);
    memcpy(&d2, src+8, 8);
    ld = (d2 == -0.0 && d1 == 0.0) ? -0.0 : d1 + d2;
    d1 = (double)ld;
    memcpy(dest, &d1, 8);
}

Converting a foreign floating-point number to this format is also trivial. You don't have to care about splitting up the number into two, you can simply cast it.

cvt_num10_num16ppc(unsigned char *dest, const unsigned char *src)
{
    double d;
    long double ld;
    cvt_num10_num8((unsigned char *)&d, src);
ld = (long double)d;
    ROUND_TO(ld, long double, 18);
    memcpy(dest, &ld, 16);
}

You just have to take care of proper rounding, as with every number read into a more precise format. I hardcoded 18 here as a ppc does not know about intel long doubles.

16-byte quadruple double / IEEE-754 binary128

This is a very rare format, native long double on sparc64 and S/390 CPUs, and SW simulated since GCC 4.6 as __float128. I have no idea why Intel did not adopt that yet. Sparc has %q0 registers since V8 (1992), but a sparc64 has no direct math support in HW. S/390 G5 supports it since 1998.

It uses 16 byte, 128 bit. 15 bits for the exponent as with the intel 80bit long double, and 112 bits for the mantissa. It can preserve 34 decimal digits, FLT128_DIG.

   sign   1  bit 127
   exp   15 bits 126-112   bias 16383
   frac 112 bits 111-0     (113 bits precision implicit)

+[15]---+[14]---+[13]---+[12]---+[11]---+[10]---+[9] .. +[0]----+
S|      exp     |             fraction                          |
+-------+-------+-------+-------+-------+-------+--- .. +-------+
1|<-----15----->|<---------------------112 bits---------------->|
<--------------------------128 bits----------------------------->

Precision: log10(2**113) ≈ 34.016 decimal digits. The first fraction bit is assumed to be 1 unless the exponent is 0x7fff.

s: significand

e=0x0,    s=0:  => +-0.0
e=0x7fff, s=0:  => +-Inf
e=0x7fff, s!=0: => NaN

3fff 0000 0000 0000 0000 0000 0000 0000   = 1
c000 0000 0000 0000 0000 0000 0000 0000   = -2
7fff 0000 0000 0000 0000 0000 0000 0000   = Inf
3ffd 5555 5555 5555 5555 5555 5555 5555   ≈  1/3

Since this format uses the same exponents as an intel long double this conversion is trivial. Just the intel normalization bit must be set.

cvt_num16_num10(unsigned char *dest, const unsigned char *src)
{
    memset(dest, 0, sizeof(long double));
    /* simply copy over sign + exp */
    dest[8] = src[15];
    dest[9] = src[14];
    /* and copy the rest */
    memcpy(&dest[0], &src[0], 8);
    dest[7] |= 0x80;  /* but set integer bit 63 */
}

No rounding problems.

Maybe I'll find time to find the remaining deviations for MIPS and AIX long double formats. I do not care about old non-IEEE IBM extended precision formats, such as on S/360 and S/370 machines.

Further Hacks

If you are having fun with this post you seriously want to look at this hack: http://blog.quenta.org/2012/09/0x5f3759df.html which uses the float format for a fast inverse square root function.

This is the function from Quake McCarmack fame that bit-casts a floating-point value to an int, does simple integer arithmetic, and then bit-casts the result back:

int i = * (int*)&x; // evil floating point bit level hacking
i = 0x5f3759df - (i >> 1); // what the fuck?
x = * (float*)&i;
x = x * (1.5F - (x * 0.5F * x * x);

The history and correct code is here (turns out 0x5f375a86 is better) and more tricks are here

native_pbc in parrot revived (numbers part1)

The design for parrot, the vm (virtual machine) under rakudo (perl6), envisioned a platform and version compatible, fast, binary format for scripts and modules. Something perl5 was missing. Well, .pbc and .pmc from ByteLoader serves this purpose, but since it uses source filters it is not that fast.

Having a binary and platform independent compiled format can skip the parsing compiling and optimizing steps each time a script or module is loaded.

Version compatiblity was broken with the 1.0 parrot release, that's why I left the project in protest a few years ago. Platform compatibility is still a goal but seriously broken, because the tests were disabled, and nobody cared.

Since I have to wait in perl5 land until p5p can decide and discuss on a syntax for the upcoming improvements which I can then implement in the type optimizers and the static B::CC compiler, I went back to parrot. p5p needs a few years to understand the most basic performance issues first. The basic obstacles in parrot were gone, parrot is almost bug free and has most features rakudo needs, but is lacking performance.

Platform compatibility

So I tried to enable platform compatibility again. I wrote and fixed most of the native_pbc code several years ago until 1.0, and only a little bit of bitrot crept in. Platform-compatible means, any platform can write a pbc and any other platform should be able to read this format. Normally such a format would require a fixed format, not so .pbc. The pbc format is optimized for native reads and writes, so all integers, pointers and numbers are stored in native format, and when you try to open such a file on a different platform converters will try to read those 3 types. integers and pointers can be 4 or 8 byte, little or big endian. This is pretty easy to support.

The problem comes with numbers. Supported until now was double and the intel specific long double format. The main problem is that the intel long double format is a tricky and pretty non-standard format. It has 80 bits, which is 10 bytes, but the numbers are stored with padding bytes, 12 byte on 32-bit and 16 byte on 64-bit. 2 bytes or 6 bytes padding. Here Intel got padding right but in the normal compiler ABI Intel is the only processor which does not care about alignment. Which leads to slow code, and countless alignment problems with fast SSE integer code. Most other processors require stricter alignment to be able to access ints and numbers faster. Intel code is also not easy to compile on better processors, because they fail on unaligned access. You cannot just access every single byte in a stream at will. At least you should not.

As it turns out sparc64 and s390 (AIX) uses for long double the IEEE-754 standard quad double 16-byte binary format, which is the best so far we can get, GCC since 4.6 supports the same format as __float128 (via the quadmath library), and finally powerpc has its own third long double format with -mlong-double-128, which is two normal 8-byte double one after another, and the result is the sum of the two, "head" and "tail". It's commonly called ppc "double-double". For smaller devices the typical format is single float, 4 bytes. Thanksfully in IEEE-754 standard format. All compilers can at least read and write it. But when it comes to va_arg() accessing ... arguments from functions, gcc fails to accept float.

So after rewriting the test library I still found some bugs in the code.

So I fixed a lot of those old bugs, esp. various intel long double confusions: with the padding bytes, 12 or 16 bytes, and a special normalize bit at 63, which is always 1 when a valid number was written to disc. So when reading such a number this bit is not part of the mantissa. Documentation for these formats was also wrong. And I added support for all missing major number formats to parrot, float, double, long double in various variants: FLOATTYPE_10 for intel, FLOATTYPE_16PPC for the powerpc double-double format, and finally FLOATTYPE_16 for IEEE-754 quadmath, i.e. __float128 or sparc64/s390 long double.

sparc64

The biggest obstacle for progress was always the lack of a UltraSparc to test the last number format. As it turns out a simple darwin/ppc Powerbook G4 was enough to generate all needed formats, together with a normal Intel multilib linux. My colleague Erin Schoenhals gave me her old powerbook for $100. The Powerbook could generate float, double, long double which is really a 16ppc double-double and gcc 4.6 could generate __float128, which is the same format as a 64bit sparc long double.

Good enough tests

One important goal was a stable test suite, that means real errors should be found, invalid old .pbc files should be skipped (remember, pbc is not version compatible anymore) and numbers only differing in natural precison loss while converting a number should be compared intelligently. Interestingly there does not even exist a good perl5 Test::More or Test::Builder numcmp method to compare floating point numbers in the needed precision. There is a Test::Number::Delta on CPAN, but this was not good enough. It only uses some epsilon, not the number of valid precision digits, and the test is also numerically not stable enough. And really, number comparisons should be in the standard. I added a Test::Builder::numcmp method locally. It works on lines of strings, but could be easily changed to take an arrayref and single number also.

Expected precision loss

So what is the expected precision loss when reading e.g. a float with intel long double? A float claims to hold 7 digits without loss, FLT_DIG, so such a conversion should keep 7 digits precision, and the test only needs to test the 7 first digits. The precision holds 24 bit, log10(2**24) ≈ 7.225 decimal digits. So 123456789.0 stored as float, converted to long double needs to be compared with something like /^1234567\d\*/ if done naively. It can be 123456755.0 or any other number between 123456700.0 and 123456799.4. Better round the last significant digit.

But first at all, what is the right precision to survive a number -> string -> number round trip? Numbers need to be sprintf-printed precise enough and need to be restored from strings precise enough. Printing more digits than supported will lead to unprecise numbers when being read back, and the same when printing not enough digits. The C library defines various defines for this number: FLT_DIG=7, DBL_DIG=16, LDBL_DIG=18, FLT128_DIG=34. But better than trusting your C library vendor is a configure probe, now in auto::format. So parrot outsmarts perl5 now by testing for the best and most precise sprintf format to print numbers. As experimentally found out, this number is usually one less than the advertised *_DIG definition. double uses %.15g, not %.16g, float uses %.6g, and so on. But this might vary on the used CPU and C library. Before parrot used hardcoded magic numbers. And wrongly.

One might say, why bother? Simply stringify it when exporting it. Everything is already supported in your c library. Two counter arguments:

  1. Fixed-width size. Native floats are easily stored in fixed-width records, strings not. Accessing the x-th float on disc is significantly faster with fixed size, and native floats are also significantly smaller than strings.

  2. Precision loss: With stringification you'll loose precision. In my configure probe I verified that we always loose the last digit. The previous code in imcc had this loss hardcoded, 15 instead of 16.

Storage

parrot's Configure also checks now for the native floattype and its size. Before a pbc header only checked the size of a number, now the type is different from the size. The size of long double can be 10, 12, or 16 and can mean completely different binary representations.

As next improvement, parrot used to store the parrot version triple in the ops library header inside the pbc format. But whenever a ops library changed, the other version number needs to be changed, the PBC_COMPAT version number, or simply the bytecode version. This needs to be done for format changes and a change of native ops. Because parrot stores and accesses ops only by index, not by name, and sorts its ops on every change. This was my main critic when I left parrot with 1.0. Because it was never thought this way. Old ops should be readable by newer parrots, just newer ops cannot not be understood. So new ops need to be added to the end.

So now the bytecode version is stored in the ops library header, and newer parrot versions with the same bytecode version can still read old pbc files. Older bytecode versions not yet, as it needs to revert the policy change from v1.0, back to pre-v1.0.

mk_native_pbc

The script to generate the native pbc on every PBC_COMPAT change was pretty immature. I wrote it several years ago. I rewrote it, still as shell script, but removed all bashisms, and enabled generating and testing all supported floting point formats in one go with custom perl Configure options tools/dev/mk_native_pbc [--my-config-options...], or when called with tools/dev/mk_native_pbc --noconf just generate and test the current configuration.

Tests again

As it turns out the tested numbers were also horrible. Someone went the easy way and tested only some exponents in the numbers, but the mantissas were always blank zeros. Numbers can be signed (there's one to two sign bits in the format), there can be -0.0, -Inf, Inf, NaN, and the mantissa is sometimes tricky to convert between various formats. The new number test has a now some such uncommon numbers to actually test the converters and expected precision loss.

Too much?

With the 5 types - 4 (float), 8 (double), 10 (intel long double), 16ppc, and 16 (float128) - and little<->big endian, there is combinatorial explosion in the number of converters. So I removed 50% of them by converting endian-ness beforehand, some of the easy conversion are best done by compiler casts whenever the compiler supports both formats, 16ppc conversions are pretty trivial to do, so there are only a few tricky conversions left. Mainly with the intel long double format. The 5*4 converters are still linked function pointers, assigned at startup-time. So it's maintainable and fast.

Optimizations

More optimizations were done by using more than single byte operations, such as builtin native bswap operations (also a new probe), and int16_t, int32_t and int64_t copy and compare ops. perl5 is btw. also pretty unoptimized in this regard. Lots of unaligned single-byte accesses. The worst of all scripting languages as measured by AddressSanitizer. A typical register is 32bit or 64 bit wide, the whole width should be used whenever possible. For the beginning the perl5 hash function is only fast on 32bit cpus. Fast checks could trade speed for size, not to bitmask every single bit. Maybe combine the most needed bits into an aligned short. But as long as there are unhandled really big optimization goals (functions, method calls, types, const) these micro optimizations just stay in my head.

Code on https://github.com/parrot/parrot/commits/native_pbc2

In a followup post I'll explain for the general community reading binary representations of numbers. Reading foreign floats would even deserve a new C library.

About Reini Urban

user-pic Working at cPanel on B::C (the perl-compiler), p2, types, parrot, B::Generate, cygwin perl and more guts (LLVM, jit, optimizations).