Reading binary floating-point numbers (numbers part2)

As explained in my previous blog post about parrot and numbers parrot writes floating-pointing numbers in the native format and reads foreign floating-point numbers in different formats.

What kind of floating-point formats exist?

I'm only studying the commonly used base 2 encodings, usually called double. Base 10 encodings decimal32, decimal64 and decimal128 also exist.

IEEE-754 defines half-precision (binary16), float (binary32), double (binary64) and quad float (binary128). It does not define the most popular format long double, esp. not the intel extended precision format, which you normally would associate with long double. There is a IEEE-754 long double but this only works on sparc64 and s390.

And since IEEE-754 long double is almost never used and hard to implement in silicon, other architectures deviated wildly also.

  • Intel uses 80-bit (10 byte) for its 12 or 16-byte long double. A complete different representation to IEEE-754.

  • Powerpc uses two 8-byte doubles for its 16-byte long double. The result is the sum of the two. double-double.

  • MIPS uses a different binary format to represent NaN, and Inf

  • I'm not so sure yet about AIX/S390 NaN.

  • sparc64 implement IEEE-754 quad float (binary128) properly, it can store it in %q registers, but the arithmetic is done in SW.

I am choosing little-endian representation here, but big-endian uses the same algorithm and code. When reading different endianness, just byteswap it before you are doing the conversion.

4-byte single FLOAT / IEEE-754 binary32

This is single precision, and only used in tiny machines. It is not even fast to compute, unless your HW is optimized to do float. double is normally faster than float, since double is HW supported.

It uses 4 byte, 32 bit. 8 bits for the exponent, and 23 bits for the mantissa. It can preserve 7-9 decimal digits.

   sign    1 bit  31
   exp     8 bits 30-23     bias 127
   frac   23 bits 22-0

+[3]----+[2]----+[1]----+[0]----+
S|  exp  |   fraction           |
+-------+-------+-------+-------+
1|<--8-->|<---23 bits---------->|
<-----------32 bits------------->

The so-called significand is the 23 fraction bits, plus an implicit leading bit which is always 1 unless the exponent bits are all 0. So the total precision is 24 bit, log10(2**24) ≈ 7.225 decimal digits

s: significand

e=0x0,  s=0:  => +-0.0
e=0xff, s=0:  => +-Inf
e=0xff, s!=0: => NaN

It is particularly odd that the sign + exponent does not align to the first byte, the exponent overlaps to the last bit of the second byte. So you have to mask off the exp and fraction.

A simple conversion to double is best done as compiler cast. Every compiler can do float and double.

cvt_num4_num8(unsigned char *dest, const unsigned char *src)
{
    float f;
    double d;
    memcpy(&f, src, 4);
    d = (double)f;
    ROUND_TO(d, double, 7);
    memcpy(dest, &d, 8);
}

This is a problematic case. double has more precision than float, so the result needs to be rounded to 7-9 digits.

8-byte DOUBLE float / IEEE-754 binary64

This is double precision, the most popular format. It uses 8 byte, 64 bit. 11 bits for the exponent, and 52 bits for the mantissa. It can preserve 15-16 decimal digits, DBL_DIG.

   sign    1 bit  63
   exp    11 bits 62-52     bias 1023
   frac   52 bits 51-0      (53 bit precision implicit)

+[7]----+[6]----+[5]----+[4]----+[3]----+[2]----+[1]----+[0]----+
S|   exp   |                  fraction                          |
+-------+-------+-------+-------+-------+-------+-------+-------+
1|<---11-->|<---------------------52 bits---------------------->|
<---------------------------64 bits----------------------------->

Precision: log10(2**53) ≈ 15.955 decimal digits. The first fraction bit is assumed to be 1 unless the exponent is 0x7ff

s: significand

e=0x0,   s=0:  => +-0.0
e=0x7ff, s=0:  => +-Inf
e=0x7ff, s!=0: => NaN

3ff0 0000 0000 0000   = 1
4000 0000 0000 0000   = 2
8000 0000 0000 0000   = -0
7ff0 0000 0000 0000   = Inf
fff0 0000 0000 0000   = -Inf
3df5 5555 5555 5555   ~ 1/3

Read into single float:

cvt_num8_num4(unsigned char *dest, const unsigned char *src)
{
    float f;
    double d;
    memcpy(&f, src, 8);
    f = (float)d;
    memcpy(dest, &d, 4);
}

No rounding problems.

Read into native long double:

cvt_num8_numld(unsigned char *dest, const unsigned char *src)
{
    double d;
    long double ld;
    memcpy(&d, src, 8);
    ld = (long double)d;
    memcpy(dest, &ld, sizeof(long double));
}

No rounding problems, as the compiler cast should handle that. Note that "native long double" can be up to 6 different binary representations, i386, amd64+i64, ppc, mips, aix or sparc64+s390.

80bit, 10-byte intel extended precision LONG DOUBLE

This is stored as 12-byte on i386 or 16 byte on x86_64 and itanium, however internally the format is still the old x87 extended precision 10 byte.

It uses 10 byte, 80 bit. 15 bits for the exponent, and 63 bits for the mantissa. It can preserve 17-19 decimal digits, LDBL_DIG.

   padding 2 or 4 byte (i386/x86_64)
   sign    1 bit  79
   exp    15 bits 78-64     bias 16383
   intbit  1 bit  63        set if normalized
   frac   63 bits 62-0

+[11]---+[10]---+[9]----+[8]----+[7]----+[6] ...+[1]----+[0]----+
|   unused      |S|     Exp     |i|          Fract              |
+-------+-------+-------+-------+-------+--- ...+-------+-------+
|<-----16------>|1|<-----15---->|1|<---------63 bits----------->|
<-------------->|<----------------80 bits----------------------->

Precision: log10(2**63) ≈ 18.965 decimal digits. The first fraction bit is here explicitly used, not hidden as before.

s: significand = frac. Looks like the Norwegian Fjord designer also helped out here. Note that this was a private clean-room design, not design by committee.

e=0x0,    i=0, s=0:  => +-0.0
e=0x0,    i=0, s!=0: => denormal
e=0x0,    i=1:       => pseudo denormal (read, but not generated)
e=0x7fff, bits 63,62=00, s=0  => old +-Inf, invalid since 80387
e=0x7fff, bits 63,62=00, s!=0 => old NaN, invalid since 80387
e=0x7fff, bits 63,62=01:      => old NaN, invalid since 80387
e=0x7fff, bits 63,62=10, s=0  => +-Inf
e=0x7fff, bits 63,62=10, s!=0 => NaN
e=0x7fff, bits 63,62=11, s=0  => silent indefinite NaN (internal Inf, 0/0, ...)
e=0x7fff, bits 63,62=11, s!=0 => silent NaN

i=0: => denormal (read, but not generated)
i=1: => normal

Reading this number into a double is tricky:

cvt_num10_num8(unsigned char *dest, const unsigned char *src)
{
    int expo, i, sign;
    memset(dest, 0, 8);
    /* exponents 15 -> 11 bits */
    sign = src[9] & 0x80;
    expo = ((src[9] & 0x7f)<< 8 | src[8]);
    if (expo == 0) {
      nul:
        if (sign)
            dest[7] |= 0x80;
        return;
    }
    expo -= 16383;       /* - bias long double */
    expo += 1023;        /* + bias for double */
    if (expo <= 0)       /* underflow */
        goto nul;
    if (expo > 0x7ff) {  /* inf/nan */
        dest[7] = 0x7f;
        dest[6] = src[7] == 0xc0 ? 0xf8 : 0xf0 ;
        goto nul;
    }
    expo <<= 4;
    dest[6] = (expo & 0xff);
    dest[7] = (expo & 0x7f00) >> 8;
    if (sign)
        dest[7] |= 0x80;
    /* long double frac 63 bits => 52 bits
       src[7] &= 0x7f; reset intbit 63 */
    for (i = 0; i < 6; ++i) {
        dest[i+1] |= (i==5 ? src[7] & 0x7f : src[i+2]) >> 3;
        dest[i] |= (src[i+2] & 0x1f) << 5;
    }
    dest[0] |= src[1] >> 3;
}

No rounding problems.

Reading an intel long double into a IEEE-754 quad double (__float128) is similar, but the difference counts.

cvt_num10_num16(unsigned char *dest, const unsigned char *src)
{

    int expo, i;
    memset(dest, 0, 16);
    dest[15] = src[9]; /* sign + exp */
    dest[14] = src[8];
    expo = ((src[9] & 0x7f)<< 8 | src[8]);
    expo -= 16383;
    /* On Intel expo 0 is allowed */
    if (expo <= 0)     /* underflow */
        return;
    if (expo > 0x7ff)  /* overflow, inf/nan */
        return;
    /* shortcut the zero mantissa check */
#if __WORDSIZE == 64
    if (*(const uint64_t*)src != 0x8000000000000000LU)
#else
    if (*(const uint32_t*)src || *(const uint32_t*)&src[4] != 0x80000000U)
#endif
    {
      for (i = 13; i > 5; i--) {
          dest[i] |= ((i==13 ? src[7] & 0x7f : src[i-5]) << 1)
                  | (src[i-6] & 0x7f) >> 7;
      }
    }
    ROUND_TO((__float128)*dest, __float128, 18);
}

Need to properly round the result into the Intel LDBL_DIG precision (18). Cut off the rest.

Powerpc 16-byte LONG DOUBLE aka "double-double"

With -mlong-double-128 a long double on ppc32 or ppc64 is stored in 16 bytes. It simple stores two double numbers, a "head" and a "tail", one after another. The head being rounded to the nearest double and the tail containing the rest.

It stores 106 bits significants (2*53), but the range is limited to the double format, 11-bit. It can preserve 31 decimal digits, LDBL_DIG. log10(2**106) ≈ 31.909 decimal digits

Reading such a number is trivial, just return the sum of the two 8-byte doubles. There is only one special case: -0.0

cvt_num16ppc_num8(unsigned char *dest, const unsigned char *src)
{
    double d1, d2;
    long double ld;
    memcpy(&d1, src, 8);
    memcpy(&d2, src+8, 8);
    ld = (d2 == -0.0 && d1 == 0.0) ? -0.0 : d1 + d2;
    d1 = (double)ld;
    memcpy(dest, &d1, 8);
}

Converting a foreign floating-point number to this format is also trivial. You don't have to care about splitting up the number into two, you can simply cast it.

cvt_num10_num16ppc(unsigned char *dest, const unsigned char *src)
{
    double d;
    long double ld;
    cvt_num10_num8((unsigned char *)&d, src);
ld = (long double)d;
    ROUND_TO(ld, long double, 18);
    memcpy(dest, &ld, 16);
}

You just have to take care of proper rounding, as with every number read into a more precise format. I hardcoded 18 here as a ppc does not know about intel long doubles.

16-byte quadruple double / IEEE-754 binary128

This is a very rare format, native long double on sparc64 and S/390 CPUs, and SW simulated since GCC 4.6 as __float128. I have no idea why Intel did not adopt that yet. Sparc has %q0 registers since V8 (1992), but a sparc64 has no direct math support in HW. S/390 G5 supports it since 1998.

It uses 16 byte, 128 bit. 15 bits for the exponent as with the intel 80bit long double, and 112 bits for the mantissa. It can preserve 34 decimal digits, FLT128_DIG.

   sign   1  bit 127
   exp   15 bits 126-112   bias 16383
   frac 112 bits 111-0     (113 bits precision implicit)

+[15]---+[14]---+[13]---+[12]---+[11]---+[10]---+[9] .. +[0]----+
S|      exp     |             fraction                          |
+-------+-------+-------+-------+-------+-------+--- .. +-------+
1|<-----15----->|<---------------------112 bits---------------->|
<--------------------------128 bits----------------------------->

Precision: log10(2**113) ≈ 34.016 decimal digits. The first fraction bit is assumed to be 1 unless the exponent is 0x7fff.

s: significand

e=0x0,    s=0:  => +-0.0
e=0x7fff, s=0:  => +-Inf
e=0x7fff, s!=0: => NaN

3fff 0000 0000 0000 0000 0000 0000 0000   = 1
c000 0000 0000 0000 0000 0000 0000 0000   = -2
7fff 0000 0000 0000 0000 0000 0000 0000   = Inf
3ffd 5555 5555 5555 5555 5555 5555 5555   ≈  1/3

Since this format uses the same exponents as an intel long double this conversion is trivial. Just the intel normalization bit must be set.

cvt_num16_num10(unsigned char *dest, const unsigned char *src)
{
    memset(dest, 0, sizeof(long double));
    /* simply copy over sign + exp */
    dest[8] = src[15];
    dest[9] = src[14];
    /* and copy the rest */
    memcpy(&dest[0], &src[0], 8);
    dest[7] |= 0x80;  /* but set integer bit 63 */
}

No rounding problems.

Maybe I'll find time to find the remaining deviations for MIPS and AIX long double formats. I do not care about old non-IEEE IBM extended precision formats, such as on S/360 and S/370 machines.

Further Hacks

If you are having fun with this post you seriously want to look at this hack: http://blog.quenta.org/2012/09/0x5f3759df.html which uses the float format for a fast inverse square root function.

This is the function from Quake McCarmack fame that bit-casts a floating-point value to an int, does simple integer arithmetic, and then bit-casts the result back:

int i = * (int*)&x; // evil floating point bit level hacking
i = 0x5f3759df - (i >> 1); // what the fuck?
x = * (float*)&i;
x = x * (1.5F - (x * 0.5F * x * x);

The history and correct code is here (turns out 0x5f375a86 is better) and more tricks are here

2 Comments

Nice, interesting write-up, and pertinent to what I'm up to. Thank you!

Leave a comment

About Reini Urban

user-pic Working at cPanel on B::C (the perl-compiler), p2, types, parrot, B::Generate, cygwin perl and more guts (LLVM, jit, optimizations).