<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Reini Urban</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/" />
    <link rel="self" type="application/atom+xml" href="http://blogs.perl.org/users/rurban/atom.xml" />
    <id>tag:blogs.perl.org,2009-11-03:/users/rurban//39</id>
    <updated>2013-05-08T20:42:48Z</updated>
    <subtitle>compiling, debugging, generating, optimizing (breaking stuff)</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.38</generator>

<entry>
    <title>Reverse debugging with gdb</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2013/05/reverse-debugging-with-gdb.html" />
    <id>tag:blogs.perl.org,2013:/users/rurban//39.4659</id>

    <published>2013-05-08T20:27:26Z</published>
    <updated>2013-05-08T20:42:48Z</updated>

    <summary>I managed to work with gdb&apos;s reverse debugging finally. That means I can step back in time, step to the previous lines and back to the callers, not only back out the backtrace. It should work since version 7.0 but...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>I managed to work with gdb's reverse debugging finally. 
That means I can step back in time, step to the previous lines and back to the callers, 
not only back out the backtrace.</p>

<p>It should work since version 7.0 but got it working only today.
The first times I got annoyed by the warning: </p>

<pre><code>Breakpoint 1, ...
(gdb) target record
(gdb) rn
Target child does not support this command.
</code></pre>

<p>Hmm... I knew nothing about target "child" and I <a href="http://sourceware.org/gdb/news/reversible.html">read</a> that those should work fine: Native i386-linux ('target record'), Native amd64-linux ('target record')</p>

<p>This is because I used the gdb command <strong>run</strong>.
Now I changed my <strong>.gdbinit</strong> to use <strong>start</strong> and <strong>continue</strong> and set proper breakpoints, and <strong>rn</strong> (reverse-next aka previous) works fine. I feel stupid.</p>

<p>$ cat .gdbinit</p>

<pre><code>set breakpoint pending on
start
b potion_vm
b potion_jit_proto
continue
target record-core
b core/vm.c:508
continue

define pd
  call potion_dump(P, $arg0)
end
</code></pre>

<p>$ make -s bin/potion-s &amp;&amp; gdb-7.6 --args bin/potion-s -Dctv -B test/closures/default1.pn
    ...</p>

<pre><code>(gdb) rn
</code></pre>

<p>and now I can go backwards from my breakpoint.</p>

<p>Note that <em>record-full</em> slows down my experience a lot, even on a tiny program, so I only use <em>record-core</em>, and only after leaving out the initialization. So I set a first breakpoint, set on recording and continue from there.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Using kcachegrind on potion</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2013/04/using-kcachegrind-on-potion.html" />
    <id>tag:blogs.perl.org,2013:/users/rurban//39.4623</id>

    <published>2013-04-28T15:55:58Z</published>
    <updated>2013-04-28T17:15:14Z</updated>

    <summary>cachegrind gives you information on the callstack and callcount, dependencies and efficiency. You can easily see hotspots in your code. I use it to check the JIT and objmodel efficiency of potion, which is the vm for p2. See my...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>cachegrind gives you information on the callstack and callcount, dependencies and efficiency. You can easily see hotspots in your code.</p>

<p>I use it to check the JIT and objmodel efficiency of potion, which is the vm for <a href="http://perl11.org/p2/">p2</a>.</p>

<p>See my first post today <a href="http://blogs.perl.org/users/rurban/2013/04/install-kachegrind-on-macosx-with-ports.html">Install kcachegrind on MACOSX with ports</a> if you are on a Mac.</p>

<h2>cachegrind</h2>

<p>The first run with:</p>

<pre><code>$ make bin/potion-s
$ valgrind --tool=callgrind -v --dump-every-bb=10000000 bin/potion-s example/binarytrees.pn
Ctrl-C
</code></pre>

<p>generates this sample</p>

<pre><code> $ open qcachegrind
</code></pre>

<p>Open one of the generated callgrind.out.pid.num files.</p>

<p><a href="http://blogs.perl.org/users/rurban/assets_c/2013/04/qcachegrind-1246.html" onclick="window.open('http://blogs.perl.org/users/rurban/assets_c/2013/04/qcachegrind-1246.html','popup','width=1284,height=699,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blogs.perl.org/users/rurban/assets_c/2013/04/qcachegrind-thumb-560x304-1246.jpg" width="560" height="304" alt="qcachegrind.jpg" class="mt-image-none" style="" /></a></p>

<p>This code creates a lot of objects (potion_object_new), potion_object_new is called by the JIT (not instrumented because we did not use --dump-instr=yes (Click on machine code on the right lower pane), and potion_object_new spends most of its time allocating memory in the GC (potion_gc_alloc). 20-30% of the code is spent in the GC (potion_gc_mark_major and the other gc calls).</p>

<h2>Instruments</h2>

<p>Compare that to the Apple XCode tool Instruments:</p>

<pre><code>$ open /Applications/Xcode.app/Contents/Applications/Instruments.app
</code></pre>

<ul>
<li>Select the Instrument "Time Profile"</li>
<li>Select the target "bin/potion-s"</li>
<li>Adjust the target settings with "Edit Active Target"</li>
</ul>

<p>I used args "-B example/binarytrees-list.pn" for the bytecode vm, not the default jit.
 and I need to set the working directory, because the example is under the root.</p>

<p><a href="http://blogs.perl.org/users/rurban/assets_c/2013/04/Instruments-1248.html" onclick="window.open('http://blogs.perl.org/users/rurban/assets_c/2013/04/Instruments-1248.html','popup','width=1123,height=782,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blogs.perl.org/users/rurban/assets_c/2013/04/Instruments-thumb-560x389-1248.jpg" width="560" height="389" alt="Instruments.jpg" class="mt-image-none" style="" /></a></p>

<p>This is a different sample, without JIT, and with using arrays (tuples) instead of objects. This way you can see any possible object overhead. And we see that most of the time tuple_push (add an array value) is not spent doing alloc, but realloc, an area where the GC should shine. 
But realloc apparently causes a lot of GCs. Of 39% realloc 30% is causing a full mark &amp; sweep phase (mark_major), not a small mark_minor phase. But we are dealing with fresh objects here, not hot objects.</p>

<h2>cachegrind graphviz</h2>

<p>With the graphviz extension you can see the call graph easier.
See the "Call Graph" tab on the lower right pane, which creates this graph.</p>

<p><a href="http://blogs.perl.org/users/rurban/assets_c/2013/04/obj_new_callgraph1-1254.html" onclick="window.open('http://blogs.perl.org/users/rurban/assets_c/2013/04/obj_new_callgraph1-1254.html','popup','width=1152,height=708,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blogs.perl.org/users/rurban/assets_c/2013/04/obj_new_callgraph1-thumb-560x344-1254.jpg" width="560" height="344" alt="obj_new_callgraph1.jpg" class="mt-image-none" style="" /></a></p>

<p>You see better that for all the object_new allocations the GC needs 30% time in a major phase and 10% time in a minor (shorter) phase.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Install kcachegrind on MacOSX with ports</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2013/04/install-kachegrind-on-macosx-with-ports.html" />
    <id>tag:blogs.perl.org,2013:/users/rurban//39.4622</id>

    <published>2013-04-28T15:17:00Z</published>
    <updated>2013-04-29T15:19:26Z</updated>

    <summary>Well, you don&apos;t want to install kcachegrind with port. $ sudo port search cachegrind kcachegrind @0.4.6 (devel) KCachegrind - Profiling Visualization Because building KDE takes hours, and you wont need it other than for cachegrind. But there&apos;s a QT variant...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>Well, you don't want to install kcachegrind with port.</p>

<pre><code>$ sudo port search cachegrind
kcachegrind @0.4.6 (devel)
    KCachegrind - Profiling Visualization
</code></pre>

<p>Because building KDE takes hours, and you wont need it other than for cachegrind.
But there's a QT variant coming with kcachegrind, called <strong>qcachegrind</strong>.
Maybe ports wants to use this variant. Or not, because <em>kdelibs3</em> is listed as dependency.</p>

<pre><code>$ sudo port info kcachegrind
</code></pre>

<p>kcachegrind @0.4.6, Revision 1 (devel)
Variants:             universal</p>

<p>Description:          KCachegrind visualizes traces generated by profiling, including a tree map and a call
                      graph visualization of the calls happening. It's designed to be fast for very large
                      programs like KDE applications.
Homepage:             http://kcachegrind.sourceforge.net/</p>

<p>Library Dependencies: kdelibs3<br>
Platforms:            darwin<br>
License:              unknown<br>
Maintainers:          nomaintainer@macports.org<br></p>

<pre><code>sudo port install kcachegrind
---&gt;  Computing dependencies for kcachegrind
Error: Unable to execute port: Can't install qt3 because conflicting ports are installed: qt4-mac
</code></pre>

<p>So there's an artificial conflict, qt4-mac is better than qt3, and you can easily build qcachegrind with qt4-mac.</p>

<p>port deps:</p>

<pre><code>sudo port install qt4-mac graphviz
</code></pre>

<p>This needed only one minute</p>

<p>Go to http://kcachegrind.sourceforge.net/html/Download.html
Download the source tarball</p>

<pre><code>kcachegrind-0.7.4.tar.gz 

tar xfz ~/Downloads/kcachegrind-0.7.4.tar.gz
cd kcachegrind-0.7.4/
cd qcachegrind/
less README
qmake -spec 'macx-g++'
make
mv qcachegrind.app /Applications/
open /Applications/qcachegrind.app
</code></pre>

<p>And it works.</p>

<p>Just gprof and gcc profiling via -pg does not work. But this is another story. So far I use XCode Instruments with the Time Profiler.
See the next post <a href="http://blogs.perl.org/users/rurban/2013/04/using-kcachegrind-on-potion.html">Using kcachegrind on potion</a></p>

<p>On linux I did the same, just qmake without arg and</p>

<pre><code>sudo cp qcachegrind /usr/local/bin/
</code></pre>
]]>
        

    </content>
</entry>

<entry>
    <title>Idea - costmodel for B::Stats </title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2013/04/idea---costmodel-for-bstats.html" />
    <id>tag:blogs.perl.org,2013:/users/rurban//39.4586</id>

    <published>2013-04-18T18:29:16Z</published>
    <updated>2013-04-19T13:01:01Z</updated>

    <summary>Co-workers often ask me, what is faster. This or this? Of course you can benchmark the real speed, but theoretically you can look at the optrees and predict what will be faster....</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>Co-workers often ask me, what is faster. This or this?</p>

<p>Of course you can benchmark the real speed, 
but theoretically you can look at the optrees and predict what will be faster.</p>
]]>
        <![CDATA[<p>E.g. accessing hash keys directly:</p>

<pre><code>$h-&gt;{k}           **helem rv2hv rv2sv gv const**
</code></pre>

<p>vs a lexical reference to the value:</p>

<pre><code>my $v = \$h-&gt;{k}     **rv2sv padsv**
</code></pre>

<p>If this is in a tight loop, and you want to change a lot of hash elems, 
the answer will be interesting.</p>

<pre><code>$ alias p=perl
$ p -MO=Concise -e'$h={1=&gt;0}; $h-&gt;{1}++; print $h-&gt;{1}'
</code></pre>

<p>vs</p>

<pre><code>$ p -MO=Concise -e'$h={1=&gt;0};my $x=\$h-&gt;{1}; $$x++; print $h-&gt;{1}, $$x'
</code></pre>

<p>1st variant directly: (<strong>helem rv2hv rv2sv gv const</strong>)</p>

<pre><code>p  &lt;@&gt; leave[t1] vKP/REFC -&gt;(end)
1     &lt;0&gt; enter -&gt;2
2     &lt;;&gt; nextstate(main 1 -e:1) v -&gt;3
9     &lt;2&gt; sassign vKS/2 -&gt;a
7        &lt;1&gt; srefgen sK/1 -&gt;8
-           &lt;1&gt; ex-list lKRM -&gt;7
6              &lt;@&gt; anonhash sKRM/1 -&gt;7
3                 &lt;0&gt; pushmark s -&gt;4
4                 &lt;$&gt; const(IV 1) s -&gt;5
5                 &lt;$&gt; const(IV 0) s -&gt;6
-        &lt;1&gt; ex-rv2sv sKRM*/1 -&gt;9
8           &lt;$&gt; gvsv(*h) s -&gt;9
a     &lt;;&gt; nextstate(main 1 -e:1) v -&gt;b
g     &lt;1&gt; preinc[t2] vK/1 -&gt;h
         vvvvvvvvvvvvvvvvv
f        &lt;2&gt; helem sKRM/2 -&gt;g
d           &lt;1&gt; rv2hv[t1] sKR/1 -&gt;e
c              &lt;1&gt; rv2sv sKM/DREFHV,1 -&gt;d
b                 &lt;$&gt; gv(*h) s -&gt;c
e           &lt;$&gt; const(IV 1) s -&gt;f
         ^^^^^^^^^^^^^^^^^^^^^^^^
h     &lt;;&gt; nextstate(main 1 -e:1) v -&gt;i
o     &lt;@&gt; print vK -&gt;p
i        &lt;0&gt; pushmark s -&gt;j
n        &lt;2&gt; helem sK/2 -&gt;o
l           &lt;1&gt; rv2hv[t3] sKR/1 -&gt;m
k              &lt;1&gt; rv2sv sKM/DREFHV,1 -&gt;l
j                 &lt;$&gt; gv(*h) s -&gt;k
m           &lt;$&gt; const(IV 1) s -&gt;n
</code></pre>

<p>2nd variant by ref: (<strong>rv2sv padsv</strong>)</p>

<pre><code>x  &lt;@&gt; leave[$x:1,end] vKP/REFC -&gt;(end)
1     &lt;0&gt; enter -&gt;2
2     &lt;;&gt; nextstate(main 1 -e:1) v -&gt;3
9     &lt;2&gt; sassign vKS/2 -&gt;a
7        &lt;1&gt; srefgen sK/1 -&gt;8
-           &lt;1&gt; ex-list lKRM -&gt;7
6              &lt;@&gt; anonhash sKRM/1 -&gt;7
3                 &lt;0&gt; pushmark s -&gt;4
4                 &lt;$&gt; const(IV 1) s -&gt;5
5                 &lt;$&gt; const(IV 0) s -&gt;6
-        &lt;1&gt; ex-rv2sv sKRM*/1 -&gt;9
8           &lt;$&gt; gvsv(*h) s -&gt;9
a     &lt;;&gt; nextstate(main 1 -e:1) v -&gt;b
i     &lt;2&gt; sassign vKS/2 -&gt;j
g        &lt;1&gt; srefgen sK/1 -&gt;h
-           &lt;1&gt; ex-list lKRM -&gt;g
f              &lt;2&gt; helem sKRM/2 -&gt;g
d                 &lt;1&gt; rv2hv[t2] sKR/1 -&gt;e
c                    &lt;1&gt; rv2sv sKM/DREFHV,1 -&gt;d
b                       &lt;$&gt; gv(*h) s -&gt;c
e                 &lt;$&gt; const(IV 1) s -&gt;f
h        &lt;0&gt; padsv[$x:1,end] sRM*/LVINTRO -&gt;i
j     &lt;;&gt; nextstate(main 2 -e:1) v -&gt;k
m     &lt;1&gt; preinc[t3] vK/1 -&gt;n
         vvvvvvvvvvvvvvvvv
l        &lt;1&gt; rv2sv sKRM/1 -&gt;m
k           &lt;0&gt; padsv[$x:1,end] sM/96 -&gt;l
         ^^^^^^^^^^^^^^^^^^^^^^^^
n     &lt;;&gt; nextstate(main 2 -e:1) v -&gt;o
w     &lt;@&gt; print vK -&gt;x
o        &lt;0&gt; pushmark s -&gt;p
t        &lt;2&gt; helem sK/2 -&gt;u
r           &lt;1&gt; rv2hv[t4] sKR/1 -&gt;s
q              &lt;1&gt; rv2sv sKM/DREFHV,1 -&gt;r
p                 &lt;$&gt; gv(*h) s -&gt;q
s           &lt;$&gt; const(IV 1) s -&gt;t
v        &lt;1&gt; rv2sv sK/1 -&gt;w
u           &lt;0&gt; padsv[$x:1,end] s -&gt;v
</code></pre>

<p>Well, 5 ops vs 2. plus some assignment. But which ops are slow and which fast? This depends a bit on the flags, but nevertheless.</p>

<p>This leads to the idea of writing a B or Devel module, like B::Speed, or Devel::Speed, similar to B::Stats or Devel::Size, which will apply a costmodel to each op, for its run-time costs, and return a number, how fast will this be at run-time.
Or just an option for <strong>B::Stats</strong>, which counts the ops at compile-, end- and run-time.</p>

<p>Every compiler optimization step needs to know about the costs for each op, 
so it could be useful for B::CC.
The numbers can be taken from profiled code, or dtrace probes.</p>

<pre><code>p -MB::Stats -e'$h={1=&gt;0}; $h-&gt;{1}++; print $h-&gt;{1}'
</code></pre>

<p>vs  </p>

<pre><code>p -MB::Stats -e'$h={1=&gt;0};my $x=\$h-&gt;{1}; $$x++; print $$x'
</code></pre>
]]>
    </content>
</entry>

<entry>
    <title>parser updates</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2013/04/parser-updates.html" />
    <id>tag:blogs.perl.org,2013:/users/rurban//39.4558</id>

    <published>2013-04-12T17:59:43Z</published>
    <updated>2013-04-12T18:51:23Z</updated>

    <summary>I worked on a new perl11 vm, p2, in the last months. In some perl11 meetings we identified several problems with the current architecture, and came to similar results as the parrot discussions a decade before. Not only is the...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>I worked on a new perl11 vm, p2, in the last months. In some perl11 meetings we identified several problems with the current architecture, and came to similar results as the parrot discussions a decade before. </p>

<p>Not only is the VM (the bytecode interpreter) horribly designed as previously observed by <a href="http://www.jilp.org/vol5/v5paper12.pdf">Gregg &amp; Ertl 2003</a>, also the parser is an untangable and not maintainable beast. And since any future VM should be able to parse and run perl5 and perl6 together, that's why we reserved <em>use v6;</em> and <em>use v5;</em> </p>

<p>Any new perl vm such as parrot, nqp with the jvm or other backends, niecza or p2 need to be able to parse both. perl6 cannot afford to leave perl5 aside, even if it's a much nicer language.
That's why parrot came up first with the PGE based parser framework, which made it super easy for other language to target parrot in the first years.</p>

<p>Parrot's PGE library is based on peg - <strong>Parsing Expression Grammar</strong>, a new parser language, different from the old <em>yacc</em> or hand-written recursive descent parsers.
The only peg C library is <a href="http://piumarta.com/software/peg/">peg/leg</a> by Ian Piumarta, which is also used by <strong>p2</strong>, based on some extensions by <em>why the lucky stiff</em>, renamed to "<strong>greg</strong>" and subsequently used for other little languages also.
_why advanced it from 0.1.9 to 0.2.2 with potion, I advanced it to 0.2.3 for p2,
Amos Wenger advanced it to 0.4.3 for his <a href="https://github.com/nddrylliog/greg">ooc language</a> by adding error blocks and fixing some bugs, and today I advanced it to <a href="https://github.com/perl11/potion/commits/p2">0.4.5</a>.</p>

<p>This is only greg, the parser generator, not the p5 or p6 syntax itself.</p>

<p>Larry's perl6/std with the viv metacompiler contains the canonical <a href="https://github.com/perl6/std/blob/master/STD.pm6">Perl6 grammar</a> and now also a <a href="https://github.com/perl6/std/blob/master/STD_P5.pm6">Perl5 grammar</a>. Written in perl6, interpreted and compiled in perl5 (via viv).</p>

<p>Flavio Glock wrote hand-written p5 and p6 parsers for perlito, and those parsers really show off, as they look much nicer, readable and maintainable as table-driven parsers such as ours, if based on yacc (perl5) or peg (std, p6, p2).</p>

<p>Using a PGE library means that you can interpret the parser statemachine at run-time, which easily allows parser extensions, so called macros.
Using a standalone parser generator, such as yacc, marpa, greg/peg/leg just generates C code for the parser statemachine, but extending such a statemachine dynamically is a not yet solved problem. Ian Piumarta and his <a href="http://piumarta.com/software/cola/">idst</a> crew work on the basis of such interpreted but efficient parsers, e.g. by jitting the statemachine as done in maru. That is something like a jitted regex engine, just a bit more advanced, as a regex is just a small subset of a general parser.</p>

<p>Advancing on that I believe that the new regex engine should be builtin into the VM, such as the <a href="http://www.inf.puc-rio.br/~roberto/lpeg/">LPeg library</a> for lua is a general PEG-based matcher, which can be used to implement the simplier pcre library.
I can really feel the pain of normal programmers which need to use the old-style perl,grep,sed regular expressions, while they could use a richer language as in LPeg or lisp matchers.</p>

<p>I am the opinion that an extendable parser needs to be based on LR based, such as yacc, and not on PEG and its ordered rules. Only with LR you can easily add rules alternatives without destroying the fragile order of evaluation in a PEG. 
With a PEG you'd need to add the position of the new macro rule manually.</p>

<p>So using greg is effectively a dead end, and I'd need to start extending yacc somewhen, adding a yacc library and yacc run-time. Which means something like hooking the created statemachine into my vm, or by jitting the states. The java based parser generators, like antlr have a huge advantage there.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>no indirect considered harmful</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2013/02/no-indirect-considered-harmful.html" />
    <id>tag:blogs.perl.org,2013:/users/rurban//39.4382</id>

    <published>2013-02-26T19:47:27Z</published>
    <updated>2013-02-26T20:25:32Z</updated>

    <summary>Several p5p members argue that using indirect method call syntax is considered harmful. I argue that using indirect method call syntax is the best and sometimes only way to extend the language without changing core or the parser rules. method_name...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>Several p5p members argue that using indirect method call syntax is considered harmful.</p>

<p>I argue that using indirect method call syntax is the best and sometimes only way to extend the language without changing core or the parser rules.</p>

<pre><code>method_name ClassName @args;
method_name $obj @args;
</code></pre>

<p>vs</p>

<pre><code>ClassName-&gt;method_name(@args);
$obj-&gt;method_name(@args);
</code></pre>

<p>E.g. mst argues in <a href="http://shadow.cat/blog/matt-s-trout/indirect-but-still-fatal/">"new Foo bad, 'no indirect' good"</a> that the parser is too dynamic in deciding if something is a valid method call or not.</p>

<p>He gives three examples:</p>

<p>(1) Is <something> a valid class name? If so, this is a method call.</p>

<pre><code>use Foo::Bar;
new Foo::Bar @args; # calls Foo::Bar-&gt;new(@args)
</code></pre>

<p>(2) Is bareword a known subroutine name? If so, this is a sub call.</p>

<pre><code>sub wotsit { ... }
wotsit { foo =&gt; 'bar', baz =&gt; 'quux' }; # calls wotsit({ ... })
</code></pre>

<p>(3) Stuff it, I'm guessing it's a method call.</p>

<pre><code>wotsit { foo =&gt; 'bar', baz =&gt; 'quux' }; # tries to call a method on a hashref
</code></pre>

<p>BOOM!</p>

<p>The problem is that those p5p hackers new to perl don't understand why Larry created this indirect method call syntax at first hand. It was to free the parser and core from defining new keywords, such as 'new' or 'delete', and let the user create it at run-time.
A parser and vm needs to be extendable. And the resulting language still needs to 
look natural.</p>

<p>The same logic applies to our not-yet-used type system. If you declare a lexical variable with a type between my and the name, it will dynamically lookup if the type is an existing 
class. Perl is a dynamic language, in which you can to extend classes and types.</p>

<p>People who advocate on using <code>Foo::Bar-&gt;new(@args)</code> over <code>new Foo::Bar @args</code> make it clear what they want, but loose on people looking from a broader view onto a language, which do not care if such a method is a method, builtin keyword, or function call. The look at the language, and Foo::Bar->new(@args) looks awful and backwards.</p>

<p>Without indirect method syntax you loose the ability to write code in a natural and understandable way.
There are corner-cases in which the parser throws errors, when a class or sub is not known, like in mst's example: 
Can't use string ("foo") as a subroutine ref while "strict refs" in use at Two.pm line 7, because the parser is dynamic and does not know. 
The addition of no indirect adds the error message:
Indirect call of method "Two::two" on a block at One.pm line 8.</p>

<p>Using no indirect is a great way to understand warnings, such as use warnings or use diagnostics.
But arguing that the parser is wrong and this syntax should be deprecated is harmful.
People should learn a little bit about language history first, before they start destroying the parts the do not understand.</p>

<p>schwern got it right by arguing pro <code>use autobox</code> which extends the notion of indirect method calls by checking the type of each object, and then you are able to easily overload and add methods.</p>

<p>In my upcoming functional perl p2 prototype even most keywords are methods, such as if, elsif, when, while. There's no need for parser to know a keyword, if the parser knows the type and structure of context. 'if' is a method of an expression, and the argument is the next block. print and all other perl keywords are no keywords anymore, they are methods implemented for various types. And they can be easily extended to handle more user-types. Such as e.g. bignum or complex support, PDL, FFI, ...</p>

<p>If you forbid indirect method syntax you cannot extend the language. </p>

<p>You increase coding safety by using unnatural but precise code. 
But then you can also argue pro LISP or Forth or ML, which uses a simple parser without  too much syntax defined via keywords. With these languages it is at least possible to use macros to extend the language. Not so with perl. </p>
]]>
        

    </content>
</entry>

<entry>
    <title>Patch known perlcore ptr problems</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/11/patch-known-perlcore-ptr-problems.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.4047</id>

    <published>2012-11-15T16:57:13Z</published>
    <updated>2012-12-11T18:06:18Z</updated>

    <summary>App::perlall 0.27, a better perlbrew at CPAN for multiple global perls, now patches some of the known security problems with buffer-overflows and use-after-free errors for the perl production releases. E.g. perlall build 5.14.2-nt builds a patched non-threaded perl, with proper...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p><strong>App::perlall</strong> 0.27, a better perlbrew at <a href="http://search.cpan.org/dist/App-perlall">CPAN</a> for multiple global perls, now patches some of the known security problems with buffer-overflows and use-after-free errors for the perl production releases.</p>

<p>E.g. <code>perlall build 5.14.2-nt</code> builds a patched non-threaded perl, with proper entries in <code>patchlevel.h</code>. A proper 5.14.3d-nt-asan not yet.</p>

<p>I currently patch only 4 known errors for non-threaded perls from 5.10 to 5.16. The latest "security fix" 5.14.3, blead and threaded perls are in a worse shape. I will add more fixes to <strong>App::perlall</strong> for these perls later. The amount of work is overwhelming.
There are at least 2 more buffer-overflows and use-after-free errors which need to be backported.</p>

<p>The details are in a Devel::PatchPerl plugin called <a href="https://github.com/rurban/App-perlall/blob/master/lib/Devel/PatchPerl/Plugin/Asan.pm">Devel::PatchPerl::Plugin::Asan</a></p>

<p>Note that <a href="http://search.cpan.org/dist/Devel-PatchPerl">Devel::PatchPerl</a> is the modularisation for <a href="http://search.cpan.org/dist/Devel-PPPort/">Devel::PPPort</a>'s buildperl which only patches perl to make them compile. Problem is that <code>clang -faddress-sanitizer</code> does not compile when it detects overflows or use-after-free, it SEGV's. Which is good.</p>

<p>I find it rather troublesome that so-called maintenance perl security releases do not fix those errors (<em>they are typically ignored</em>, the special word is <em>warnocked</em>), and some releases add even more security problems than fixing it. E.g. 5.14.3-nt has three more such problems than 5.14.2-nt, which has only 3 known problems.</p>

<p>I also find it rather troublesome that the perl5 porters still do not use <code>clang -faddress-sanitizer</code> (now renamed to <code>-fsanitize=address</code> as there is also a new <code>-fsanitize=threads</code> which we use for parrot) or at least <strong>valgrind</strong> or <strong>gcc mudflap</strong> to check their release candidates against reported pointer errors.</p>

<ul>
<li>5.10.0 - 5.14.3 RT#111586 sdbm.c off-by-one access to global .dir
acdbe25bd91bf897e0cf373b9</li>
<li>5.12.0 - 5.16.0 RT#72700 List::Util boot Fix off-by-two on string literal length</li>
<li>5.15.[4-9], 5.17.[0-6] RT#115702 overlapping memcpy in to_utf8_case</li>
<li>5.6.0 - 5.16.0 RT#111594 Socket::unpack_sockaddr_un heap-buffer-overflow</li>
</ul>

<p>Not yet patched:</p>

<ul>
<li>[RT #115992] heap-use-after-free in t/op/local.t
This is no regression, it came with an updated test in 5.14.3, triggered by:
BEGIN { eval '1' }
local $[;
Very low security impact</li>
<li>[RT #115994] regkind overflow. 
This is no regression, it came with an updated test in 5.14.3
Very low security impact</li>
</ul>

<p>For the record:</p>

<p><code>$ perlall build 5.14.3d-nt-asan</code></p>

<pre><code>Failed 23 tests out of 1965, 98.83% okay.
    ../cpan/Archive-Tar/t/02_methods.t
    ../dist/IO/t/io_file.t
    ../ext/re/t/reflags.t
    op/die.t
    op/local.t
    op/split.t
    op/turkish.t
    re/charset.t
    re/fold_grind.t
    re/pat_advanced.t
    re/pat_re_eval.t
    re/reg_fold.t
    re/regexp.t
    re/regexp_noamp.t
    re/regexp_notrie.t
    re/regexp_qr.t
    re/regexp_qr_embed.t
    re/regexp_trielist.t
    re/subst.t
    re/substT.t
    re/subst_wamp.t
    uni/fold.t
    uni/latin2.t
</code></pre>

<p>This has already 3 patches applied, more are needed.</p>

<p><code>$ perlall build 5.16.2d-nt-asan</code></p>

<pre><code>Failed 18 tests out of 2189, 99.18% okay.
../ext/XS-APItest/t/newCONSTSUB.t
../ext/re/t/reflags.t
../lib/perl5db.t
op/die.t
op/split.t
porting/checkcase.t
re/charset.t
re/fold_grind.t
re/pat_advanced.t
re/pat_rt_report.t
re/reg_fold.t
re/regexp_qr_embed.t
re/subst.t
re/substT.t
re/subst_wamp.t
uni/latin2.t
uni/opcroak.t
uni/parser.t
</code></pre>

<p>After analysis of the patches, 5.14.3 is not more insecure than 5.14.2.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Optimizing compiler benchmarks (part 4)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-4.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3934</id>

    <published>2012-10-08T19:03:28Z</published>
    <updated>2012-10-16T21:20:35Z</updated>

    <summary>nbody - More optimizations In the first part I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an regexp example which was very bad to optimize. In the second part I got 2 times...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<h2>nbody - More optimizations</h2>

<p>In the <a href="http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html">first part</a>
I showed some problems and possibilities of the B::C compiler and
B::CC optimizing compiler with an regexp example which was very bad to
optimize.</p>

<p>In the <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html">second part</a>
I got 2 times faster run-times with the B::CC compiler with the
<a href="http://shootout.alioth.debian.org/u32/performance.php?test=nbody">nbody</a> benchmark, which does a lot of arithmetic.</p>

<p>In the <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-3.html">third part</a>
I got 4.5 times faster run-times with perl-level AELEMFAST optimizations, and discussed optimising array accesses via no autovivification or types.</p>

<p>Optimising array accesses showed the need for autovivification detection in B::CC and better stack handling for more ops and datatypes, esp. aelem and helem. </p>

<p>But first let's study more easier goals to accomplish. If we look at
the generated C source for a simple arithmetic function, like
<code>pp_sub_offset_momentum</code> we immediately detect more possibilities.</p>

<pre><code>static
CCPP(pp_sub_offset_momentum)
{
    SV *sv, *src, *dst, *left, *right;
    NV rnv0, lnv0, d1_px, d2_py, d3_pz, d4_mass, d7_tmp, d10_tmp, d13_tmp, d15_tmp, d17_tmp, d19_tmp, d21_tmp, d23_tmp, d25_tmp, d27_tmp, d29_tmp, d31_tmp, d33_tmp, d35_tmp, d37_tmp, d40_tmp, d42_tmp, d44_tmp;
    PERL_CONTEXT *cx;
    MAGIC *mg;
    I32 oldsave, gimme;
    dSP;
  lab_2a41220:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* only needed once */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[1]); /* not needed at all */
    d1_px = 0.00;
  lab_2a41370:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[2]); /* not needed at all */
    d2_py = 0.00;
  lab_2a50a00:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[3]); /* not needed at all */
    d3_pz = 0.00;
  lab_2a50b30:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[4]); /* not needed at all */
  lab_2a50cc0:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    PUSHs(AvARRAY(MUTABLE_AV(PL_curpad[5]))[0]);    /* no autovivification */
    sv = POPs;
    MAYBE_TAINT_SASSIGN_SRC(sv);    /* not needed */
    SvSetMagicSV(PL_curpad[4], sv); /* i.e. PL_curpad[4] = sv; */
    ...
</code></pre>

<p>We can study the expanded macros with:</p>

<pre><code>cc_harness -DOPT -E -O2 -onbody.perl-2.perl-1.i nbody.perl-2.perl.c
</code></pre>

<p><code>TAINT_NOT</code> does <code>(PL_tainted = (0))</code>. It is needed only once, because nobody
changes <code>PL_tainted</code>. We can also ignore taint checks generally by setting <code>-fomit_taint</code>.</p>

<pre><code>perl -MO=Concise,offset_momentum nbody.perl-2a.perl

main::offset_momentum:
42 &lt;1&gt; leavesub[1 ref] K/REFC,1 -&gt;(end)
-     &lt;@&gt; lineseq KP -&gt;42
1        &lt;;&gt; nextstate(main 141 (eval 5):4) v -&gt;2
4        &lt;2&gt; sassign vKS/2 -&gt;5
2           &lt;$&gt; const(NV 0) s -&gt;3
3           &lt;0&gt; padsv[$px:141,145] sRM*/LVINTRO -&gt;4
...
</code></pre>

<p><code>sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;</code> is the 2nd part of the inlined code for 
<code>nextstate</code> and resets the stack pointer. As we keep track of the stack by ourselves we can
omit most of these resets in nextstate.</p>

<p><code>FREETMPS</code> is also part of <code>nextstate</code>, and calling it after each basic
block is optimized by -O1, and -O2 would free the temps after each
loop.  If FREETMPS is needed at all, i.e. if locals are used in the
function at all, is not checked yet.</p>

<p><code>SAVECLEARSV(PL_curpad[1-4])</code> is part of <code>padsv /LVINTRO</code>, but here unneeded, since
it is in the context of sassign. So the value of the lexical does not need to be cleared
before it is set. And btw. the setter of the lexical is already optimized to a temporary.</p>

<p><code>MAYBE_TAINT_SASSIGN_SRC(sv)</code> is part of <code>sassign</code> and can be omitted with <code>-fomit_taint</code>,
and since we are at <code>TAINT_NOT</code> we can leave it out.</p>

<p><code>SvSetMagicSV(PL_curpad[4], sv)</code> is also part of the optimized <code>sassign</code> op, just not
yet optimized enough, since sv cannot have any magic. A type declaration for the <code>padsv</code>
would have used the faster equivalent <code>SvNV(PL_curpad[4]) = SvNV(sv);</code> put on the stack.</p>

<p>We can easily test this out by NOP'ing these code sections and see the costs.</p>

<p>With <strong>4m53.073s</strong>, without <strong>4m23.265s</strong>. 30 seconds or ~10% faster. This is now in the typical
range of p5p micro-optimizations and not considered high-priority for now.</p>

<p>Let's rather check out more stack optimizations.</p>

<p>I added a new <a href="https://github.com/rurban/perl-compiler/commit/edda0c5ca8cd8fd072e425977dd3a1f80d34857c">B::Stackobj::Aelem</a> object to B::Stackobj to track aelemfast accesses
to array indices, and do the PUSH/POP optimizations on them.</p>

<p>The generated code now looks like:</p>

<pre><code>  lab_116f270:
    TAINT_NOT;
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
    FREETMPS;
    rnv0 = d9_mag; lnv0 = SvNV(AvARRAY((AV*)PL_curpad[25])[1]); /* multiply */
    d3_mm2 = lnv0 * rnv0;
  lab_116be90:
    TAINT_NOT;
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
    FREETMPS;
    d5_dx = SvNV(PL_curpad[5]);
    rnv0 = d3_mm2; lnv0 = d5_dx;    /* multiply */
    d29_tmp = lnv0 * rnv0;
    SvNVX(AvARRAY((AV*)PL_curpad[28])[0]) = SvNVX(AvARRAY((AV*)PL_curpad[28])[0]) - d29_tmp;
</code></pre>

<p>Lvalue assignments need SvNVX, right-value can keep SvNV.
The multiply op for <code>PL_curpad[28])[0]</code> has the OPf_MOD flag since the first arg is modified.
nextstate with TAINT, FREETMPS and sp reset is still not optimized.</p>

<p>Performance went from <strong>4m53.073s</strong> to <strong>3m58.249s</strong>, 55s or 18.7% faster. Much better than with the nextstate optimizations. 30s less on top of this would be <strong>3m30s</strong>, still slower than Erlang, Racket or C#. And my goal was 2m30s.</p>

<p>But there's still a lot to optimize (loop unrolling, aelem, helem, ...) and adding the <a href="https://github.com/rurban/perl-compiler/commit/cc90753d69000453856f4746fd885e058c30ff4b">no autovivification check</a> was also costly. 
Several dependant packages were added to the generated code, like autovivification, Tie::Hash::NamedCapture, mro,
Fcntl, IO, Exporter, Cwd, File::Spec, Config, FileHandle, IO::Handle,
IO::Seekable, IO::File, Symbol, Exporter::Heavy, ...
But you don't see this cost in the binary size, and neither in the run-time.</p>

<p>I also tested the <a href="http://shootout.alioth.debian.org/u32/benchmark.php?test=fannkuchredux&amp;lang=all">fannkuchredux</a> benchmark, which was created for 
a bad <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.5124">LISP compiler</a> in 1994, also with array accessors.</p>

<p>Uncompiled with N=10 I got 16.093s, compiled 9.1222s, almost 2x times
faster (1.75x).  And this code has the same aelem problem as nbody, so
a loop unrolling to aelemfast and better direct accessors with
no-autovivification would lead to a ~4x times faster run-time.</p>

<h2>nextstate optimisations</h2>

<p>nextstate and its COP brother dbstate are mainly used to store the line
number of the current op for debugging. 
I wrote an <a href="https://github.com/rurban/perl/commits/oplines">oplines patch</a>
already to move the line info to all OPs, which reduced the need for
90% nextstate ops, which would overcome the problem we are facing here:</p>

<pre><code>    PL_op = &amp;curcop_list[0];
    TAINT_NOT; /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* rarely needed */
    FREETMPS; /* rarely needed, only with TMPs */
</code></pre>

<p>oplines is not yet usable because it only reduces the number of nextstate ops, 
but I haven't written the needed change to warnings and error handling which 
would be needed to search for the current cop with warn or die, to be able to
display the file name together with the line number.</p>

<p>A different strategy would be to create simplier state COPs, without TAINT check, 
without stack reset and without FREETMPS.
Like <code>state, state_t, state_s, state_f, state_ts, state_sf, state_tsf == nextstate</code>.</p>

<p><em>TBC...</em></p>
]]>
        

    </content>
</entry>

<entry>
    <title>Optimizing compiler benchmarks (part 3)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-3.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3926</id>

    <published>2012-10-06T04:24:02Z</published>
    <updated>2012-10-17T19:51:40Z</updated>

    <summary>nbody - Unrolling AELEM loops to AELEMFAST In the first part I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an regexp example which was very bad to optimize. In the second part I...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    <category term="benchmark" label="benchmark" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<h2>nbody - Unrolling AELEM loops to AELEMFAST</h2>

<p>In the <a href="http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html">first
part</a>
I showed some problems and possibilities of the B::C compiler and
B::CC optimizing compiler with an regexp example which was very bad to
optimize.</p>

<p>In the <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html">second
part</a>
I got 2 times faster run-times with the B::CC compiler with the
<a href="http://shootout.alioth.debian.org/u32/performance.php?test=nbody">nbody</a> benchmark, which does a lot of arithmetic.</p>

<p>Two open problems were detected: slow function calls, and slow array accesses.</p>

<p>At first I inlined the function call which is called the most, <code>sub advance</code>
which was called N times, N being 5000, 50.000 or 50.000.000.</p>

<pre><code>for (1..$n) {
    advance(0.01);
}
</code></pre>

<p>The runtime with N=50.000.000 went from 22m13.754s down to 21m48.015s,
25s less. This is not what I wanted.
php and jruby are at 12 min and 9m. So it is not slow functions calls,
it is slow array access.
Inspecting the opcodes shows that a lot of AELEM ops are used, for
reading and writing arrays.</p>

<p>AELEM checks for lvalue invocation and several more flags, which do
exist at compile-time and there exists a fast version already
AELEMFAST, but this only operates on literal constant indices, already
known at compile-time. The index is stored at compile-time in the
op->private flag then.</p>

<p>So instead of</p>

<pre><code>for (my $j = $i + 1; $j &lt; $last + 1; $j++) {
  # inner-loop $j..4
  $dx = $xs[$i] - $xs[$j];
  $dy = $ys[$i] - $ys[$j];
  $dz = $zs[$i] - $zs[$j];
  ...
</code></pre>

<p>One could generate a macro-like string which just evals the array indices and generate from this
string the final function.</p>

<p>Array accesses: <code>$a[const]</code> are optimized to AELEMFAST, <code>$a[$lexical]</code> not.
So unroll the loop in macro-like fashion.</p>

<pre><code>$energy = '
sub energy
{
  my $e = 0.0;
  my ($dx, $dy, $dz, $distance);';
  for my $i (0 .. $last) {
$energy .= "
# outer-loop $i..4
    \$e += 0.5 * \$mass[$i] *
          (\$vxs[$i] * \$vxs[$i] + \$vys[$i] * \$vys[$i] + \$vzs[$i] * \$vzs[$i]);
";
  for (my $j = $i + 1; $j &lt; $last + 1; $j++) {
$energy .= "
    # inner-loop $j..4
    \$dx = \$xs[$i] - \$xs[$j];
    \$dy = \$ys[$i] - \$ys[$j];
    \$dz = \$zs[$i] - \$zs[$j];
    \$distance = sqrt(\$dx * \$dx + \$dy * \$dy + \$dz * \$dz);
    \$e -= (\$mass[$i] * \$mass[$j]) / \$distance;";
    }
  }
$energy .= '
  return $e;
}';
eval $energy; die if $@;
</code></pre>

<p>Every <code>$i</code> and <code>$j</code> got expanded into a literal, 0 .. 4.</p>

<p>I did this loop unrolling for the three functions, and the results
were impressive. It is a nice little macro trick which you could use
for normal uncompiled perl code also.  With compiled code the
loop-unrolling should happen automatically.</p>

<p>Full code here: <a href="https://github.com/rurban/shootout/commit/62b216756320e8c224eef2c933326924ab73c18a">nbody.perl-2.perl</a></p>

<p>Original:</p>

<pre><code>$ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl 50000
script/perlcc: c time: 0.380353
script/perlcc: cc time: 0.967525
-0.169075164
-0.169078071
script/perlcc: r time: 2.214327
</code></pre>

<p>Unrolled:</p>

<pre><code>$ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl-2.perl 50000
script/perlcc: c time: 0.448817
script/perlcc: cc time: 2.167499
-0.169075164
-0.169078071
script/perlcc: r time: 1.341283
</code></pre>

<p>Another <strong>2x times faster!</strong></p>

<p>For comparison the same effect uncompiled:</p>

<pre><code>$ time perl ../shootout/bench/nbody/nbody.perl 50000
-0.169075164
-0.169078071

real    0m3.650s
user    0m3.644s
sys 0m0.000s
</code></pre>

<p>Unrolled:</p>

<pre><code>$ time perl ../shootout/bench/nbody/nbody.perl-2.perl 50000
-0.169075164
-0.169078071

real    0m2.399s
user    0m2.388s
sys 0m0.004s
</code></pre>

<p>So we went from <strong>3.6s</strong> down to <strong>2.4s</strong> and compiled to <strong>1.3s</strong>.</p>

<p>With N=50,000,000 we got <strong>14m12.653s</strong> uncompiled and <strong>7m11.3597s</strong>
compiled. Close to jruby, even if the array accesses still goes
through the <code>av_fetch</code> function, magic is checked and undefined indices
are autovivified.</p>

<h2>Generalization</h2>

<p>The above macro-code code looks pretty unreadable, similar to lisp
macros, with its mix of quoted and unquoted variables.  The compiler
needs to detect unrollable loop code which will lead to more
constants and AELEMFAST ops. And we better define a helper function
for easier generation of such unrolled loops.</p>

<pre><code># unquote local vars
sub qv {
  my ($s, $env) = @_;
  # expand our local loop vars
  $s =~ s/(\$\w+?)\b/exists($env-&gt;{$1})?$env-&gt;{$1}:$1/sge;
  $s
}

$energy = '
sub energy
{
  my $e = 0.0;
  my ($dx, $dy, $dz, $distance);';
  for my $i (0 .. $last) {
    my $env = {'$i'=&gt;$i,'$last'=&gt;$last};
    $energy .= qv('
    # outer-loop $i..4
    $e += 0.5 * $mass[$i] *
          ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);', $env);
    for (my $j = $i + 1; $j &lt; $last + 1; $j++) {
      $env-&gt;{'$j'} = $j;
      $energy .= qv('
      # inner-loop $j..4
      $dx = $xs[$i] - $xs[$j];
      $dy = $ys[$i] - $ys[$j];
      $dz = $zs[$i] - $zs[$j];
      $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
      $e -= ($mass[$i] * $mass[$j]) / $distance;', $env);
    }
  }
  $energy .= '
  return $e;
}';
eval $energy; die if $@;
</code></pre>

<p>This looks now much better and leads in a BEGIN block to only neglectible
run-time penalty.
Full code here: <a href="https://github.com/rurban/shootout/commit/c35bb85ed84941157eb01b7ca844d3b4472e0df3">nbody.perl-2a.perl</a></p>

<p>I also tried a generic <code>unroll_loop()</code> function, but it was a bit too
unstable finding the end of the loop blocks on the source level, and
<code>qv()</code> looked good enough. The compiler can use the optree to find the
optimization.</p>

<h2>Types and autovivification</h2>

<p>A naive optimization would check the index ranges beforehand, and access
the array values directly. Something the type optimizer for arrays would
do.</p>

<pre><code>my (num @xs[4],  num @ys[4],  num @zs[4]);
my (num @vxs[4], num @vys[4], num @vzs[4]);
my num @mass[4];
</code></pre>

<p>And instead of <code>$xs[0] * $xs[1]</code> which compiles to
AELEMFASTs, currently inlined by B::CC to:</p>

<pre><code>{ AV* av = MUTABLE_AV(PL_curpad[6]);
  SV** const svp = av_fetch(av, 0, 0);
  SV *sv = (svp ? *svp : &amp;PL_sv_undef);
  if (SvRMAGICAL(av) &amp;&amp; SvGMAGICAL(sv)) mg_get(sv);
  PUSHs(sv);
}
{ AV* av = MUTABLE_AV(PL_curpad[6]);
  SV** const svp = av_fetch(av, 1, 0);
  SV *sv = (svp ? *svp : &amp;PL_sv_undef);
  if (SvRMAGICAL(av) &amp;&amp; SvGMAGICAL(sv)) mg_get(sv);
  PUSHs(sv);
}
rnv0 = POPn; lnv0 = POPn;       /* multiply */
d30_tmp = lnv0 * rnv0;
</code></pre>

<p>It should compile to:</p>

<pre><code>d30_tmp = (double)AvARRAY(PL_curpad[6])[0] *
          (double)AvARRAY(PL_curpad[6])[1];
</code></pre>

<p>With the size declaration you can omit the <code>av_fetch()</code> call and undef
check ("autovivification"), with the type <code>num</code> you do not need to get
to the <code>SvNV</code> of the array element, the value is stored directly, and
the type also guarantees that there is no magic to be checked.  So
<code>AvARRAY(PL_curpad[6])[0]</code> would return a double.</p>

<p>And the stack handling (PUSH, PUSH, POP, POP) can also be optimized
away, since the ops are inlined already.  That would get us close to
an optimizing compiler as with Haskell, Lua, PyPy or LISP. Not close
to Go or Java, as their languages are stricter.</p>

<p>I tried a simple B::CC AELEMFAST optimization together with "no autovivification"
which does not yet eliminate superfluous PUSH/POP pairs but could be applied
for typed arrays and leads to another 2x times win.</p>

<p>2.80s down to 1.67s on a slower PC with N=50,000.</p>

<p>Compiled to <em>(perlcc /2a)</em>:</p>

<pre><code>PUSHs(AvARRAY(PL_curpad[6])[0]));
PUSHs(AvARRAY(PL_curpad[6])[1]));
rnv0 = POPn; lnv0 = POPn;       /* multiply */
d30_tmp = rnv0 * lnv0;
</code></pre>

<p>Without superfluous PUSH/POP pairs I suspect another 2x times win. But this
is not implemented yet. With typed arrays maybe another 50% win, and we don't
need the no autovivification overhead.</p>

<p>It should look like <em>(perlcc /2b)</em>:</p>

<pre><code>rnv0 = SvNV(AvARRAY(PL_curpad[6])[0]);
lnv0 = SvNV(AvARRAY(PL_curpad[6])[1]);
d30_tmp = rnv0 * lnv0;          /* multiply */
</code></pre>

<p>I'm just implementing the check for the 'no autovivification' pragma and
the stack optimizations.</p>

<h2>Summary</h2>

<p><a href="http://shootout.alioth.debian.org/u64q/performance.php?test=nbody">u64q nbody</a></p>

<p>Original numbers with N=50,000,000:</p>

<pre><code>* Fortran       14.09s
* C             20.72s
* Go            32.11s
* SBCL          42.75s
* Javascript V8 44.78s - 82.49s
* JRuby       8m
* PHP        11m
* Python 3   16m
* Perl       23m
* Ruby 1.9   26m
</code></pre>

<p>My numbers with N=50,000,000:</p>

<pre><code>* Perl       22m14s
* Perl /1    21m48s         (inline sub advance, no ENTERSUB/LEAVESUB)
* perlcc      9m52s
* Perl /2    14m13s          (unrolled loop + AELEM =&gt; AELEMFAST)
* perlcc /2   7m11s
* perlcc /2a  4m52s           (no autovivification, 4.5x faster)
* perlcc /2b  ? (~2m30)      (no autovivification + stack opt)
* perlcc /2c  ? (~1m25s)     (typed arrays + stack opt)
</code></pre>

<p>Continued at <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-4.html">part 4</a> with stack and nextstate optimisations. But I only reached 3m30s, not 2m30s so far.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Optimizing compiler benchmarks (part 2)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3922</id>

    <published>2012-10-04T13:55:40Z</published>
    <updated>2012-10-10T14:46:36Z</updated>

    <summary>nbody - unboxed inlined arithmetic 2x faster In the first part I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an example which was very bad to optimize, and promised for the next day...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    <category term="benchmark" label="benchmark" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<h1>nbody - unboxed inlined arithmetic 2x faster</h1>

<p>In the <a href="http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html">first part</a> I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an example which was very bad to optimize, and promised for the next day an improvement with "stack smashing", avoiding copy overhead between the compiler stacks and global perl data.</p>

<p>The next days I went to Austin to meet with the <a href="http://perl11.org/">perl11.org</a> group, which has as one of the goals an optimizing compiler for perl5, and to replace all three main parts of perl: the parser, the compiler/optimizer and the vm (the runtime) at will. You can do most of it already, esp. replace the runloop, but the 3 parts are too intermingled and undocumented.</p>

<p>So I discussed the "stack smashing" problem with Will and my idea on the solution.</p>

<h2>1. The "stack smashing" problem</h2>

<p>B::CC keeps two internal stacks to be able to optimize arithmetic and boolean operations on numbers, int IV and double NV.</p>

<p>The first stack, called "stack", keeps the perl operand stack, the arguments for each push/pop runtime pp_* function.
The perl AST, called optree, is a bit disfunctional, as it not able to generate optimized variants on the operand types. So e.g there are integer optimized versions, used with "use integer", e.g. an i_add variant for the add operator, which are used if both operands are known to be integers or integer constants at compile-time. There are no variants for strict NV, and most importantly, there are no variants for degenerated arguments, arguments with magic. Because you can add magic at run-time, which the compiler (op.c) does not know about (<em>which is super lame</em>), all argument types need to be checked at run-time, and you always have to go the slow general path.</p>

<p>This B::CC stack, implemented in B::Stackobj keeps track of the types during the lifetime and can optimize and pessimize the used type of each stack variable. It can esp. optimize on bool, int and double, needs to exchange values on stringification and pessimizes on magic, esp. tie (a variable) and overload (an op).
There is no nice picture in <a href="http://search.cpan.org/dist/illguts/">illguts</a> describing operand stacks.</p>

<p>The second stack holds lexicals. For all perl lexical variables the same is done, the B::Stackobj::Padsv lexical stack mimics every function's PADLIST. Contrary to the stack, the access to this list is optimized by the compiler (op.c) already at compile-time in the PL_curpad array, since it defines lexicals variables which are fully known at compile-time. So each op knows the exact index into the padlist, stored in the fields "targ" or "padix". Pad types are dynamic as on the stack, but there is an additional list, the comppadnames, which holds optional type information, essentially a pointer to packagename for each typed variable. It is just unused yet.
See illguts for a <a href="http://cpansearch.perl.org/src/RURBAN/illguts-0.42/index.html#stack">picture</a>.</p>

<p>B::Stackobj::Padsv can use the perl type information for the packages int and double, but not yet for number and bool or CTypes nor Moose types. So <code>my int $i</code> already specifies an optimized integer, as an IV during the scope of <code>use integer</code>.</p>

<p>Specifying types via perl attributes, like <code>my $i :int;</code> would be nice also, but is technically impossible, since there is no compile-time method for attributes to be checked; only at run-time.</p>

<p><code>B::CC</code> keeps a list of good ops (pp_ functions), where the type or at least the number of arguments or return types is known at compile time. The same information is also available in the <a href="http://search.cpan.org/dist/Opcodes/">Opcodes</a> module on CPAN.
Defined are op lists like, no_stack, skip_stack,  skip_lexicals, skip_invalidate, need_curcop and for various other predefined op types needed for other optimizers or the Jit module. Does it branch, i.e. will it always return op->next or not?
Does it need to call PERL_ASYNC_CHECK?</p>

<p>On all unknown ops or ops which need to access lexicals, the current internal B::Stackobj::Padsv lexical stack values need to refreshed, written back from the internal compiler stack to the actual values on the heap, which is <code>PL_curpad[ix].</code> (sub write_back_lexicals). The same must be done for all stack variables which need to accessed by the next op (sub write_back_stack). Just not for ops, which do not access stack variables.</p>

<p>So there is a lot of theoretical copying - "stack smashing" - going on.
But B::CC is cleverly keeping track of the stack areas which need to be written back, so in practice only the really needed values are handled.
In practice only numeric and boolean optimizations operate on private c variables on the C stack, rather than on referenced heap values, either on the perl stack or in the curpad. Simple sort callbacks also.
So only on unboxed numbers we need to copy the values back and force, before and after, as B::CC inlines most of these ops.</p>

<h2>2. Benchmarks</h2>

<p>I'll take a benchmark in which Perl is very slow compared to other scripting
languages, and which does a lot of arithmetic. Because I expect the
B::CC type optimizer to kick in, unboxing all the numbers, and inling
most of the arithmetic.</p>

<p><a href="http://shootout.alioth.debian.org/u32/performance.php?test=nbody">nbody</a>
performs a simple N-body simulation of the Jovian planets.
Perl is currently by far the slowest scripting language for nbody,
26 min compared to 9-18 min for ruby, php or python with n=50,000,000.</p>

<p>perl = perl5.14.2-nt (non-threaded, -Os -msse4.2 -march=corei7)</p>

<pre><code>$ time perl ../shootout/bench/nbody/nbody.perl 50000

-0.169075164
-0.169057007

 real   0m1.305s
 user   0m1.300s
 sys    0m0.000s
</code></pre>

<p>Compiled:</p>

<pre><code>$ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp,-DOscpSqlm,-v \
              ../shootout/bench/nbody/nbody.perl 50000

script/perlcc: c time: 0.171225
script/perlcc: cc time: 0.984996
-0.169075214
-0.169078108
script/perlcc: r time: 0.600024
</code></pre>

<p>So we get a <strong>2x times faster run-time</strong>, with a little bit of different results and a lot of interesting command line options.</p>

<p><strong>--time</strong> prints the B::CC time as 'c time', the gcc and ld time as 'cc time', and the run-time as 'r time'. 0.625202s vs. 1.305s in pure perl. Even gcc plus ld with -OS is faster than perl. And B::CC's optimizer is also real fast here in this simple example.</p>

<p><strong>-r</strong> runs the compiled program with the rest of the perlcc arguments</p>

<p><strong>-O</strong> compiles with B::CC, not with the non-optimizing B::C</p>

<p><strong>-S</strong> keeps the C source, to be able to inspect the generated optimized C code.</p>

<p><strong>-O1</strong> adds some minor B::CC optimizations. -O2 is pretty unstable yet, and B::CC proper (-O0) already adds all B::C -O3 optimizations.</p>

<p><strong>--Wb</strong> defines further B::CC options</p>

<p><strong>-fno-destruct</strong> is a B::C option to skip optree destruction at the very end. It does thread, IO and object destruction in proper order, and of course does object destruction during run-time, but we do not care of memory leaks with normal executables. Process termination does it better and faster than perl. Even daemons are safe to be compiled with -fno-destruct, just not shared libraries.</p>

<p>-U defines packages to be <strong>unused</strong>. warnings, B, Carp are notorious compiler packages, which are innocently being pulled in, even if you do not use or call them.</p>

<p>B is used by the compiler itself, and since the B maintainer does a terrible job helping the B compiler modules, we have to manually force B get out of our way. warnings and Carp are also magically pulled in by some dependent core modules and cause a lot of startup and memory overhead. These 3 packages are easily skipped with simple programs or benchmarks, in the real world you have to live with multi-megabyte compiled programs. This reflects the reality of the memory perl uses during run-time.</p>

<p>E.g. without -U the numbers are:</p>

<pre><code>cc pp_main
 cc pp_sub_offset_momentum
 cc pp_sub_energy
 cc pp_sub_advance
 Prescan 1 packages for unused subs in main::
 Saving unused subs in main::
 old unused: 1, new: 1
 no %SIG in BEGIN block
 save context:
 cc pp_sub_warnings__register_categories
 cc pp_sub_warnings___mkMask
 Total number of OPs processed: 193
 NULLOP count: 0
 bootstrapping DynaLoader added to xs_init
 no dl_init for B, not marked
 my_perl_destruct (-fcog)
script/perlcc: c time: 0.192175
script/perlcc: cc time: 1.3049
-0.169075214
-0.169078108
script/perlcc: r time: 0.642252
</code></pre>

<p><strong>-DOscpSqlm</strong> are some debugging options, which add interesting information into the generated C code. B::CC adds debugging output as comments into the C code, to be able to inspect the optimizer result, B::C prints debugging output to STDOUT.</p>

<p>Let's have a look into the <a href="http://shootout.alioth.debian.org/u32/program.php?test=nbody&amp;lang=perl&amp;id=1">code</a>.</p>

<p><code>cat ../shootout/bench/nbody/nbody.perl</code></p>

<pre><code># The Computer Language Shootout
# http://shootout.alioth.debian.org/
#
# contributed by Christoph Bauer
# converted into Perl by Márton Papp
# fixed and cleaned up by Danny Sauer
# optimized by Jesse Millikan

use constant PI            =&gt; 3.141592653589793;
use constant SOLAR_MASS    =&gt; (4 * PI * PI);
use constant DAYS_PER_YEAR =&gt; 365.24;

#  Globals for arrays... Oh well.
#  Almost every iteration is a range, so I keep the last index rather than a count.
my (@xs, @ys, @zs, @vxs, @vys, @vzs, @mass, $last);

sub advance($)
{
  my ($dt) = @_;
  my ($mm, $mm2, $j, $dx, $dy, $dz, $distance, $mag);

  #  This is faster in the outer loop...
  for (0..$last) {
  #  But not in the inner loop. Strange.
    for ($j = $_ + 1; $j &lt; $last + 1; $j++) {
      $dx = $xs[$_] - $xs[$j];
      $dy = $ys[$_] - $ys[$j];
      $dz = $zs[$_] - $zs[$j];
      $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
      $mag = $dt / ($distance * $distance * $distance);
      $mm = $mass[$_] * $mag;
      $mm2 = $mass[$j] * $mag;
      $vxs[$_] -= $dx * $mm2;
      $vxs[$j] += $dx * $mm;
      $vys[$_] -= $dy * $mm2;
      $vys[$j] += $dy * $mm;
      $vzs[$_] -= $dz * $mm2;
      $vzs[$j] += $dz * $mm;
    }

    # We're done with planet $_ at this point
    # This could be done in a seperate loop, but it's slower
    $xs[$_] += $dt * $vxs[$_];
    $ys[$_] += $dt * $vys[$_];
    $zs[$_] += $dt * $vzs[$_];
  }
}

sub energy
{
  my ($e, $i, $dx, $dy, $dz, $distance);

  $e = 0.0;
  for $i (0..$last) {
    $e += 0.5 * $mass[$i] *
          ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);
    for ($i + 1..$last) {
      $dx = $xs[$i] - $xs[$_];
      $dy = $ys[$i] - $ys[$_];
      $dz = $zs[$i] - $zs[$_];
      $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
      $e -= ($mass[$i] * $mass[$_]) / $distance;
    }
  }
  return $e;
}

sub offset_momentum
{
  my ($px, $py, $pz) = (0.0, 0.0, 0.0);

  for (0..$last) {
    $px += $vxs[$_] * $mass[$_];
    $py += $vys[$_] * $mass[$_];
    $pz += $vzs[$_] * $mass[$_];
  }
  $vxs[0] = - $px / SOLAR_MASS;
  $vys[0] = - $py / SOLAR_MASS;
  $vzs[0] = - $pz / SOLAR_MASS;
}

# @ns = ( sun, jupiter, saturn, uranus, neptune )
@xs = (0, 4.84143144246472090e+00, 8.34336671824457987e+00, 1.28943695621391310e+01, 1.53796971148509165e+01);
@ys = (0, -1.16032004402742839e+00, 4.12479856412430479e+00, -1.51111514016986312e+01, -2.59193146099879641e+01);
@zs = (0, -1.03622044471123109e-01, -4.03523417114321381e-01, -2.23307578892655734e-01, 1.79258772950371181e-01);
@vxs = map {$_ * DAYS_PER_YEAR}
  (0, 1.66007664274403694e-03, -2.76742510726862411e-03, 2.96460137564761618e-03, 2.68067772490389322e-03);
@vys = map {$_ * DAYS_PER_YEAR}
  (0, 7.69901118419740425e-03, 4.99852801234917238e-03, 2.37847173959480950e-03, 1.62824170038242295e-03);
@vzs = map {$_ * DAYS_PER_YEAR}
  (0, -6.90460016972063023e-05, 2.30417297573763929e-05, -2.96589568540237556e-05, -9.51592254519715870e-05);
@mass = map {$_ * SOLAR_MASS}
  (1, 9.54791938424326609e-04, 2.85885980666130812e-04, 4.36624404335156298e-05, 5.15138902046611451e-05);

$last = @xs - 1;

offset_momentum();
printf ("%.9f\n", energy());

my $n = $ARGV[0];

# This does not, in fact, consume N*4 bytes of memory
for (1..$n){
  advance(0.01);
}

printf ("%.9f\n", energy());
</code></pre>

<p>A lot of arithmetic, only three functions, advance is called 50,000 times, the others only once.</p>

<p>The generated C code for some inlined arithmetic looks like:</p>

<p><code>$ grep -A50 pp_sub_energy nbody.perl.c</code></p>

<pre><code>static
CCPP(pp_sub_energy)
{
    double rnv0, lnv0, d1_e, d2_i, d3_dx, d4_dy, d5_dz, d6_distance, d11_tmp, d13_tmp,
           d15_tmp, d16_tmp, d18_tmp, d19_tmp, d20_tmp, d22_tmp, d31_tmp, d32_tmp, d33_tmp,
           d34_tmp, d35_tmp, d37_tmp, d38_tmp;
    SV *sv, *src, *dst, *left, *right;
    PERL_CONTEXT *cx;
    MAGIC *mg;
    I32 oldsave, gimme;
    dSP;
    /* init_pp: pp_sub_energy */
    /* load_pad: 39 names, 39 values */
    /* PL_curpad[1] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[1] iv=i1_e nv=d1_e */
    /* PL_curpad[2] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[2] iv=i2_i nv=d2_i */
    /* PL_curpad[3] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[3] iv=i3_dx nv=d3_dx */
    /* PL_curpad[4] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[4] iv=i4_dy nv=d4_dy */
    /* PL_curpad[5] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[5] iv=i5_dz nv=d5_dz */
    /* PL_curpad[6] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[6] iv=i6_distance nv=d6_distance */
    /* PL_curpad[7] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[7] iv=i7_last nv=d7_last */
    /* PL_curpad[8] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[8] iv=i8_tmp nv=d8_tmp */
    /* PL_curpad[9] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[9] iv=i9_tmp nv=d9_tmp */
    /* PL_curpad[10] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[10] iv=i10_tmp nv=d10_tmp */
    /* PL_curpad[11] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[11] iv=i11_tmp nv=d11_tmp */
    /* PL_curpad[12] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[12] iv=i12_tmp nv=d12_tmp */
    /* PL_curpad[13] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[13] iv=i13_tmp nv=d13_tmp */
    /* PL_curpad[14] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[14] iv=i14_tmp nv=d14_tmp */
    /* PL_curpad[15] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[15] iv=i15_tmp nv=d15_tmp */
    /* PL_curpad[16] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[16] iv=i16_tmp nv=d16_tmp */
    /* PL_curpad[17] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[17] iv=i17_tmp nv=d17_tmp */
    /* PL_curpad[18] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[18] iv=i18_tmp nv=d18_tmp */
    /* PL_curpad[19] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[19] iv=i19_tmp nv=d19_tmp */
    /* PL_curpad[20] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[20] iv=i20_tmp nv=d20_tmp */
    /* PL_curpad[21] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[21] iv=i21_tmp nv=d21_tmp */
    /* PL_curpad[22] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[22] iv=i22_tmp nv=d22_tmp */
    /* PL_curpad[23] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[23] iv=i23_tmp nv=d23_tmp */
    /* PL_curpad[24] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[24] iv=i24_tmp nv=d24_tmp */
    /* PL_curpad[25] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[25] iv=i25_tmp nv=d25_tmp */
    /* PL_curpad[26] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[26] iv=i26_tmp nv=d26_tmp */
    /* PL_curpad[27] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[27] iv=i27_tmp nv=d27_tmp */
    /* PL_curpad[28] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[28] iv=i28_tmp nv=d28_tmp */
    /* PL_curpad[29] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[29] iv=i29_tmp nv=d29_tmp */
    /* PL_curpad[30] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[30] iv=i30_tmp nv=d30_tmp */
    /* PL_curpad[31] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[31] iv=i31_tmp nv=d31_tmp */
    /* PL_curpad[32] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[32] iv=i32_tmp nv=d32_tmp */
    /* PL_curpad[33] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[33] iv=i33_tmp nv=d33_tmp */
    /* PL_curpad[34] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[34] iv=i34_tmp nv=d34_tmp */
    /* PL_curpad[35] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[35] iv=i35_tmp nv=d35_tmp */
    /* PL_curpad[36] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[36] iv=i36_tmp nv=d36_tmp */
    /* PL_curpad[37] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[37] iv=i37_tmp nv=d37_tmp */
    /* PL_curpad[38] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[38] iv=i38_tmp nv=d38_tmp */
    /* PL_curpad[39] = Padsv type=T_UNKNOWN flags=VALID_SV|REGISTER|TEMPORARY sv=PL_curpad[39] iv=i39_tmp nv=d39_tmp */
  lab_1fd4ba0:  /* nextstate */
    /* stack =  */
    /* COP (0x1fd4ba0) nextstate [0] */
    /* ../shootout/bench/nbody/nbody.perl:51 */
    TAINT_NOT;
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
    FREETMPS;
    /* write_back_stack() 0 called from B::CC::compile_bblock */
  lab_1fd4a10:  /* pushmark */
    /* stack =  */
    /* OP (0x1fd4a10) pushmark [0] */
    /* write_back_stack() 0 called from B::CC::pp_pushmark */
    PUSHMARK(sp);
    /* stack =  */
    /* OP (0x1fd4960) padsv [1] */
    SAVECLEARSV(PL_curpad[1]);
    /* stack = PL_curpad[1] */
    /* OP (0x1fd49c0) padsv [2] */
    SAVECLEARSV(PL_curpad[2]);
    /* stack = PL_curpad[1] PL_curpad[2] */
    /* OP (0x1fd4a40) padsv [3] */
    SAVECLEARSV(PL_curpad[3]);
    /* stack = PL_curpad[1] PL_curpad[2] PL_curpad[3] */
    /* OP (0x1fd4a90) padsv [4] */
    SAVECLEARSV(PL_curpad[4]);
    /* stack = PL_curpad[1] PL_curpad[2] PL_curpad[3] PL_curpad[4] */
    /* OP (0x1fd4990) padsv [5] */
    SAVECLEARSV(PL_curpad[5]);
    /* stack = PL_curpad[1] PL_curpad[2] PL_curpad[3] PL_curpad[4] PL_curpad[5] */
    /* OP (0x1fd4930) padsv [6] */
    SAVECLEARSV(PL_curpad[6]);
    /* stack = PL_curpad[1] PL_curpad[2] PL_curpad[3] PL_curpad[4] PL_curpad[5] PL_curpad[6] */
    /* LISTOP (0x1e99820) list [0] */
    /* list */
    /* write_back_stack() 6 called from B::CC::pp_list */
    EXTEND(sp, 6);
    PUSHs((SV*)PL_curpad[1]);
    PUSHs((SV*)PL_curpad[2]);
    PUSHs((SV*)PL_curpad[3]);
    PUSHs((SV*)PL_curpad[4]);
    PUSHs((SV*)PL_curpad[5]);
    PUSHs((SV*)PL_curpad[6]);
    PP_LIST(1);
    /* write_back_stack() 0 called from B::CC::compile_bblock */
    ...
</code></pre>

<p>No interesting code, but you get the idea that the compiler keeps track of all
the used lexicals and stack variables and was able to optimize some types of most of the
numeric lexicals.</p>

<pre><code>sub energy
{
  my ($e, $i, $dx, $dy, $dz, $distance);
...
</code></pre>

<p>E.g.
<code>PL_curpad[1] = Padsv type=T_UNKNOWN flags=VALID_SV sv=PL_curpad[1] iv=i1_e nv=d1_e</code></p>

<p><code>PL_curpad[1]</code> the first lexical, which is named <code>i1_e</code> for the IV value and $e in the perl code.</p>

<p><code>type=T_UNKNOWN</code> means that there was no strict type information inferred. <code>T_DOUBLE</code> would have been
better as <code>$e</code> is only used as NV and returns the resulting energy. A declaration of <code>my double $e;</code>
would have done that.</p>

<p><code>flags=VALID_SV</code> is also not optimal, <code>|REGISTER|TEMPORARY</code> would be better. <code>iv=i1_e nv=d1_e</code>
are the two theoretical dual vars during the life-time in this local function. But only the
NV <code>d1_e</code> is used. The IV part <code>i1_e</code> is never used and not declared.</p>

<p>Let's continue to some interesting parts:</p>

<pre><code>lab_1fffd30:    /* nextstate */
/* ../shootout/bench/nbody/nbody.perl:61 */
TAINT_NOT;
sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
FREETMPS;
/* stack =  */
/* OP (0x1fd5260) padsv [3] */
/* stack = PL_curpad[3] */
/* OP (0x1fd5290) padsv [3] */
/* stack = PL_curpad[3] PL_curpad[3] */
/* BINOP (0x1fd51c0) multiply [31] */
d3_dx = SvNV(PL_curpad[3]);
rnv0 = d3_dx; lnv0 = d3_dx; /* multiply */
d31_tmp = lnv0 * rnv0;
/* stack = d31_tmp */
/* OP (0x1fffaf0) padsv [4] */
/* stack = d31_tmp PL_curpad[4] */
/* OP (0x1fffb20) padsv [4] */
/* stack = d31_tmp PL_curpad[4] PL_curpad[4] */
/* BINOP (0x1fffb50) multiply [32] */
d4_dy = SvNV(PL_curpad[4]);
rnv0 = d4_dy; lnv0 = d4_dy; /* multiply */
d32_tmp = lnv0 * rnv0;
/* stack = d31_tmp d32_tmp */
/* BINOP (0x1fffb90) add [33] */
rnv0 = d32_tmp; lnv0 = d31_tmp; /* add */
d33_tmp = lnv0 + rnv0;
/* stack = d33_tmp */
/* OP (0x1fffbd0) padsv [5] */
/* stack = d33_tmp d5_dz */
/* OP (0x1fffc00) padsv [5] */
/* stack = d33_tmp d5_dz d5_dz */
/* BINOP (0x1fffc30) multiply [34] */
rnv0 = d5_dz; lnv0 = d5_dz; /* multiply */
d34_tmp = lnv0 * rnv0;
/* stack = d33_tmp d34_tmp */
/* BINOP (0x1fffc70) add [35] */
rnv0 = d34_tmp; lnv0 = d33_tmp; /* add */
d35_tmp = lnv0 + rnv0;
/* stack = d35_tmp */
/* UNOP (0x1fffcb0) sqrt [6] */
/* write_back_lexicals(0) called from B::CC::default_pp */
sv_setnv(PL_curpad[5], d5_dz);
sv_setnv(PL_curpad[31], d31_tmp);
sv_setnv(PL_curpad[32], d32_tmp);
sv_setnv(PL_curpad[33], d33_tmp);
sv_setnv(PL_curpad[34], d34_tmp);
sv_setnv(PL_curpad[35], d35_tmp);
/* write_back_stack() 1 called from B::CC::default_pp */
EXTEND(sp, 1);
PUSHs((SV*)PL_curpad[35]);
PL_op = (OP*)&amp;unop_list[31];
DOOP(PL_ppaddr[OP_SQRT]);
/* invalidate_lexicals(0) called from B::CC::default_pp */
/* stack =  */
</code></pre>

<p>This is part of:</p>

<pre><code>  $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
</code></pre>

<p>We see the <code>OP_SQRT</code> as last part, not inlined, and all the simple
<code>+</code> and <code>*</code> being unboxed and inlined via tempory variables.
What I called stack smashing is <code>write_back_lexicals</code> writing back
the nv values of <code>PL_curpad[5]</code> and <code>PL_curpad[31-35]</code>,
and <code>write_back_stack()</code> <code>PL_curpad[35]</code> as argument for SQRT.</p>

<p>My idea was to calculate directly on the <code>SvNVX(PL_curpad[*])</code> values,
but on second thought I believe copying the values to temporaries,
basically in local stack locations or even in registers is faster
than doing ptr references to them. Initialising and writing them back
seems to be okay and not exaggerated.</p>

<p>So arithmetic optimizations are already pretty good, sqrt could be
inlined also, since perl has no bignum promotion, so the big remaining
problems are consting, function calls, method calls and stabilize
B::CC.</p>

<p>To compare real numbers, <em>50_000_000</em> is the argument used at alioth, leading to 26m.
My PC is a bit faster, needing 22m13s.</p>

<pre><code>$ time perl5.14.2-nt ../shootout/bench/nbody/nbody.perl 50000
-0.169075164
-0.169096567

real    **0m13.132s**
user    0m13.109s
sys         0m0.000s
</code></pre>

<p>Compiled:</p>

<pre><code>perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp \
       ../shootout/bench/nbody/nbody.perl 50000

script/perlcc: c time: 0.158228
script/perlcc: cc time: 0.98483
-0.169075214
-0.169096616
script/perlcc: r time: **5.992293**
</code></pre>

<p>Comparable times with N=50,000,000:</p>

<pre><code>perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp \
       ../shootout/bench/nbody/nbody.perl 50000000

script/perlcc: c time: 0.19311
script/perlcc: cc time: 0.962425
-0.169075214
-0.169096616
script/perlcc: r time: **591.965999**
</code></pre>

<p>Reference value, uncompiled:</p>

<pre><code>time perl5.14.2-nt ../shootout/bench/nbody/nbody.perl 50000000
-0.169075164
-0.169059907

real        22m13.754s
user        22m8.155s
sys         0m1.156s
</code></pre>

<p>591.965999s = <strong>9m51.966s</strong> vs 22m13.754s</p>

<p>So we bypassed python 3 (18m), php (12min) and ruby 1.9 (23m), but not jruby (9m).
jruby uses 585,948KB memory though, and at least php has a better algo.</p>

<p>Function calls and more optimisations are inspected in part 3, hopefully with
the <a href="http://shootout.alioth.debian.org/u32/performance.php?test=binarytrees">binarytrees</a>
benchmark.  I will also analyse the calls to the <code>sub analyse</code> loop
here, as <code>sub analyse</code> can be easily optimized automatically. This
function does not throw exceptions, has a prototype defining one
argument, has no return value and ignores return values, and does not
define any locals. It even can be automatically inlined.</p>

<pre><code>for (1..$n) {
    advance(0.01);
}
</code></pre>

<p>The uncompiled, inlined version for <code>sub analyse</code> needs 21m48.015s, 25s less.
Compiled and inlined manually: 612.395542s (10m12s), a bit slower than not inlined.
So the biggest performance hit is the unoptimized slow AELEM op, which accesses
array elements. With an immediate AELEM the run-time should be 8-10x faster,
such as the AELEMFAST op, which is already inlined. I'm going for LVAL optimizations
in AELEM.
Typed arrays would also help a lot here.</p>

<h2>Increasing precision</h2>

<p>The casual reader might have noticed that the compiler result would
not pass the shootout precision tests as it produced an invalid
result.</p>

<p>Wanted: +-1e-8 with arg 1000</p>

<pre><code>-0.169075164
-0.169087605
</code></pre>

<p>Have: with arg 1000</p>

<pre><code>-0.169075214
-0.169087656
</code></pre>

<p>That's not even close, it's a 6 digit precision. The perl afficionado
might remember the Config settings <code>nvgformat</code> to print out NV, and
<code>d_longdbl</code> to define if <code>long double</code> is defined.</p>

<p><code>long double</code> is however not used as NV, and worst <code>%g</code> is used as
printf format for NV, not <code>%.16g</code> as it should be done to keep double
precision. <code>%g</code> is just a pretty result to the casual reader, but not
suitable to keep precision, e.g. for Data::Dumper, YAML, JSON,
or the B::C or perlito compilers.</p>

<p>So I changed the NV format to %.16g with commit <a href="https://github.com/rurban/perl-compiler/commit/3fc61aa69af24d438a2983a15996362207443f43">3fc61aa</a>
and the precision went up, passing the nbody shootout test with argument 1000.</p>

<p>New result with arg 1000</p>

<pre><code>-0.169075164
-0.169087605
</code></pre>

<p>Exactly the same. Also for other n cmdline arguments.</p>

<p>See <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-3.html">part 3</a> which finds more optimizations, being 2x times faster on top of this.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>My perl5 TODO list</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/09/my-perl5-todo-list.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3878</id>

    <published>2012-09-24T19:42:25Z</published>
    <updated>2012-10-15T20:42:16Z</updated>

    <summary>Below is a formal list of possible optimizations, which most would agree on. We had these discussion in 2001 with damian were perl6 and perl5i took off. I&apos;d like to work on these for perl5 core and need decisions. Most...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    <category term="optimizations" label="optimizations" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>Below is a formal list of possible optimizations, which most would agree on.
We had these discussion in 2001 with damian were perl6 and perl5i took off.
I'd like to work on these for perl5 core and need decisions. Most p5p 
hackers seem to be informed about the general possibilities and directions, 
but not all. We'd need this to improve general perl5 performance, and also 
help static compilation.[1]</p>

<p>We had this before, so I'd like to keep it formal. So each
proposal gets a perl6-like name, and replies should change the subject
to that name. I choose PDD for "Perl Design Draft".</p>

<p>Beforehand: "compiler" means op.c not B::C. compile-time and run-time
should be obvious.</p>

<h1>PDD01 const / readonly lexicals</h1>

<p>The CONST op currently is a SVOP, holding a global gvsv.
A CONST op might hold lexicals also, a PADOP type.
The more constants the compiler knows at compile-time the better
it can optimize.
The following datatypes need to be represented as const: </p>

<ul>
<li><p>PADSV (lexicals and esp. function arguments)</p></li>
<li><p>"PDD02 final classes - const @ISA"</p></li>
<li><p>"PDD03 immutable classes - const %class::"</p></li>
</ul>

<p>Esp. readonly function arguments need to be parsed into lexical consts, 
but "my const $i" or "my $i:ro" also. I have no opinion on "my $i is ro", 
but it would be the best choice.
See "PDD05 Function and method signatures"</p>

<h2>Datatypes:</h2>

<p>SVt_READONLY already is good enough to hold this information in the data.
But the compiler does not want to optimize on datatypes, the information
needs to represented as OP. Just for the special cases @ISA and stashes
it is not needed.</p>

<p>So either add a mixin svop+padop type for CONST decriminated by OPpCONST_PAD 1,
add a CONST flag to PADSV,</p>

<p>or add a new CONSTPAD op, replacing PADSV/const which needs to be added 
into all current CONST checks in the compiler.</p>

<h2>CONST with OPpCONST_PAD flag:</h2>

<p>Pro: Easier and faster for the compiler.</p>

<p>Contra: The logic for the new OP type which is a union of SVOP and PADOP
  needs to be added for all accessors. B and its libraries, but also XS walkers.</p>

<h2>PADSV with OPpPAD_CONST flag</h2>

<p>Pro: Does not break libraries</p>

<p>Contra: CONST checks need to check PADSV's also.</p>

<h2>CONSTPAD:</h2>

<p>Pro: Does not break libraries</p>

<p>Contra: CONST checks need to check CONSTPAD also.</p>

<p>Personally I lean against CONSTPAD.</p>

<h2>Keywords: (how to parse)</h2>

<p>The following variants are being considered:
lexicals and globals:</p>

<pre><code>my const $i; my const ($i, $j) = (0, 1);   (as const keyword upfront)
my $i :ro;
my $i is ro;
</code></pre>

<p>See "PDD05 Function and method signatures"</p>

<pre><code>sub call (const $i) {}
sub call ($i:ro) {}
sub call ($i is ro) {}
</code></pre>

<p>See "PDD02 final classes - const @ISA"</p>

<pre><code>const our @ISA = ('MyBase');
our @ISA :ro = ('MyBase');
our @ISA is ro = ('MyBase');
class MyClass is final {
  our @ISA = ('MyBase');
}
class MyClass (extends =&gt; ('MyBase'), is_final =&gt; 1) {}
</code></pre>

<p>See "PDD03 immutable classes - const %class::"</p>

<pre><code>const package MyClass { } and const package MyClass;
const %MyClass::;
class MyClass is immutable {}
class MyClass (is_immutable =&gt; 1) {}
</code></pre>

<p>No keyword. immutable should be the new default for the class keyword, 
old-style packages stay mutable.</p>

<p>Keyword discussion:</p>

<p>The type qualifier const, which creates CONST/CONSTPAD op and sets the
SVf_READONLY flag can be represented either as new keyword "const",
which looks most natural, but is hardest to parse. Larry opposed it
initially, because it looked to C++ish.  But nowadays it looks best.</p>

<p>The attribute it would be easiest to parse, as a MYTERM also parses
and handles attributes, The MYTERM type just needs be extended for
signatures. It also looks natural.</p>

<p>The perl6-like type trait is harder to parse, and a bit unnatural
for lexicals.</p>

<p>The Moose style hash attributes only work for classes, not for
lexicals and sigs.</p>

<h1>PDD02 final classes - const @ISA</h1>

<p>A const isa is commonly known as "final" keyword. The class is not extendable, 
the compiler can do compile-time method resolution, i.e. convert a method
to a function.</p>

<p>Pro: Compile-time method resolution</p>

<p>If the compiler knows at compile-time for each method, that all isa's
until the method is found are const and also those classes are immutable (const),
the method can be converted to a function.
That would be a huge performance win, esp. with classes with favor methods
over hash accessors. </p>

<p>Note that the accessor typo problem could also be solved with const 
hashes of the object representation, but nobody is using that yet. 
A const class (const %classname::) not, as this is independent of the 
underlying object representation, which is usually a blessed hash.</p>

<p>Function calls are slow, and method calls even 10% minimum slower. 
(10% for immediately found methods, for a deeper search the run-time costs are higher)</p>

<p>Contra:</p>

<p>I hope the "final" problem is known from java. Since the compiler needs to
know in advance the inheritances it is not possible to extend and override
methods of final classes. One cannot extend java strings.
Thanks to Michael Schwern for the discussion.</p>

<p>Solutions:</p>

<ol>
<li><p>(Reini): Define the following convention. No additional keywords needed.
Libraries may use final, but finalization is defered until the application
is processed, and all libraries (use statements) are already loaded. So mocking
is still possible, but the default is to use compile-time method resolution.
Schwern sees a problem in that scheme which I haven't understood yet.</p></li>
<li><p>(Larry): Libraries may use final, but the application with a
<code>#pragma final</code> has the final word.</p></li>
</ol>

<p>See also <a href="https://github.com/rurban/perl/blob/1fbee7377bf50bfdcdfbc1e3f8bbddb608015262/pddtypes.pod">pddtypes.pod</a></p>

<h1>PDD03 immutable classes - const %class::</h1>

<p>Classes should default to immutable, packages keep the dynamic behaviour
unless a package is declared as const. (Damian)</p>

<p>Some might know from Moose that immutable classes makes it 20x faster, 
even if not all possible optimizations are yet done. </p>

<h1>PDD04 Types</h1>

<p>They are already parsed for lexicals, just not for named arguments.
The 3 coretypes int, num, string need to be reserved. p5-mop will
probably define more. bool needs to be added probably also.</p>

<p>Type conventions in core are needed to</p>

<ol>
<li><p>talk to other languages, like json, perl6 or java, </p></li>
<li><p>to specify the wanted behavior for methods acting on types, 
such as smartmatch or multi-methods, or </p></li>
<li><p>for special performance purposes, e.g. int loop counters, int arithmetic,
smaller and faster typed arrays or hashes, or to enforce compile-time method resolution.</p></li>
</ol>

<p>See <a href="https://github.com/rurban/perl/blob/1fbee7377bf50bfdcdfbc1e3f8bbddb608015262/pddtypes.pod">pddtypes.pod</a>
and <a href="https://github.com/rurban/perl/blob/1fbee7377bf50bfdcdfbc1e3f8bbddb608015262/pod/perltypes.pod">perltypes.pod</a>
I had an old version at <a href="http://blogs.perl.org/users/rurban/2011/02/use-types.html">my blog</a> and 
at <a href="http://cpansearch.perl.org/src/RURBAN/types-0.05_04/doc/yapc_2011.pod">YAPC</a></p>

<p>An initial benefit would be natively typed arrays and hashes in core,
with const hashes even optimizable hashes (so called "perfect
hashes"). Further type checks and optimizers are left to modules.</p>

<h1>PDD04.1 CHECK_SCALAR_ATTRIBUTES</h1>

<p>Compile-time attribute hook for our three types to 
be able to use attributes for my declarations.</p>

<p>Note: Attributes still suffer from an over-architectured and broken
Attribute::Handler implementation which evals the attribute value.</p>

<p><code>our $name:help(print the name);</code> will call eval "print the name";</p>

<p>Without fixing this, attributes will have no chance to be accepted.
The syntax is nice, and it is already parsed.</p>

<h1>PDD05 Function and method signatures</h1>

<p>The current prototype syntax explictly allows named arguments.
There are several implementations already.</p>

<p>But there are several decisions required.</p>

<p>In order to optimize function and method calls, we need to define
type qualifiers, and eventually return types, even if they are not
used yet.</p>

<p>New syntax allows changing the semantics.</p>

<p>Lets follow perl6: </p>

<ul>
<li><p>is bind (default) vs is copy (old semantics)</p></li>
<li><p>is ro (default) vs is rw (old semantics)</p></li>
<li><p>allow passing types and attributes to functions.
attributes allow user-define hooks as now, just on function entries,
not on variable declarations.</p></li>
</ul>

<p>Optional arguments are defined by specifying defaults.</p>

<p>If we do not follow perl6 syntax with "is", we need attributes
to specify ":rw" and possibly "\$" to specify bind (by reference).</p>

<pre><code>e.g. sub myadder (\$i, $num = 1) { $i =+ $num }
or   sub myadder ($i:rw, $num = 1) { $i =+ $num }
</code></pre>

<p>bind ro is by far the fastest calling convention. 
optimizable and checkable by the compiler. copy is the safe way,
rw uses the old $_[n] semantics.</p>

<p>I outlined my proposal in <a href="https://github.com/rurban/perl/blob/1fbee7377bf50bfdcdfbc1e3f8bbddb608015262/pddtypes.pod">pddtypes.pod</a></p>

<p>Q: Do function args and return values keep constness?</p>

<p>A: Only function args by ref. This is current behaviour and makes sense.</p>

<h1>PDD06 Function return types</h1>

<p>Any optimizer needs to stop if a function return type is not known.
We don't even know if any value is returned at all, so we have to check @_
at every LEAVESUB, though the parser knows the context information already.
By optionally declaring return types, a type checker and optimizer can kick in.
Esp. for coretypes like int, num, str or void or a const qualifier.</p>

<p>There exist old and wildly different syntaxes for return types,
but they are unused. Use the perl6 syntax, which is c-like.</p>

<p>Q What about libraries declaring their return values constant? I cannot change
them then and have to copy them?</p>

<p>A: No. Return values so far are not const. Only if you declare a function to
return a const it will be so.</p>

<h1>PDD07 Compile-time entersub optimizations</h1>

<p>Calling a function via ENTERSUB and cleaning
up at LEAVESUB is by far the slowest part of perl.</p>

<p>We can check our functions for the following situations:
exceptions, jumps out, lexicals, locals, function calls,
recursive calls.</p>

<p>If none of these occur, the function can be inlined.</p>

<p>We also need to check for tail calls and arguments. (signatures)</p>

<p>If no exceptions or no locals occur the parts in ENTERSUB and LEAVESUB
which deal with that can be skipped.</p>

<p>We need to store the context and possible return type in ENTERSUB and
LEAVESUB to speed up @_ handling.  </p>

<p>We need to seperate XS calls from ENTERSUB.</p>

<h1>PDD08 Compile-time op-arg optimizations</h1>

<p>Our current optree resolves op argument types (the compile-time op
flags and also the POP'ed flags) at run-time.  For the cases the op
itself specifies the behavior or the argument type can be compile-time
deferred (lvalue, context, magic, ...), an optimized op version should be
used.</p>

<p>Promote type pessimization to all affected ops, and use optimized ops
for non-pessimized. Similar to i_opt (integer constant folding) if all
operands are non-magic IVs.</p>

<p>The biggest blocker are functions borders. Without named arguments
passed as bind (alias), each function must optimize from scratch and
looses all information.</p>

<h1>PDD09 Compile-time function inlining</h1>

<p>See "PDD07 Compile-time entersub optimizations". entersub (and
leavesub) needs to hold compiler information about the function, which
requires waiting for parsing all embedded functions.</p>

<p>Even functions with arguments can be inlined, for safe versions with 
arguments by copy, and destructive arguments by bind. They just need 
a scope block.</p>

<h1>PDD10 Compile-time method resolution</h1>

<p>We can easily change run-time method calls at compile-time to
function calls. What is left is a decision on "PDD02 final classes - const @ISA"
and "PDD03 immutable classes - const %class::"</p>

<p>Outlined here <a href="http://blogs.perl.org/users/rurban/2011/06/how-perl-calls-subs-and-methods.html">how-perl-calls-subs-and-methods</a>
and further refined at "Compile-time type optimizations" in <a href="https://github.com/rurban/perl/blob/1fbee7377bf50bfdcdfbc1e3f8bbddb608015262/pod/perltypes.pod">perltypes</a></p>

<h1>PDD11 Compile-time method inlining</h1>

<p>This just does method resolution (change to functions) and
then does function inlining.</p>

<h1>PDD12 Run-time method caching</h1>

<p>This is trivial as there are already isa change hooks.
METHOD_NAMED and METHOD just need a check a global method or
object cache.</p>

<h1>PDD13 Multi</h1>

<p>multi needs types. (As smartmatch needs types to work reliably.)</p>

<p>As for the syntax multi can be implemented traditionally where the
compiler generates the different methods per types automatically, or
the perl6 way, with a seperate keyword.  I see no problem with the
first approach. This would need no new keyword.</p>

<h1>PDD14 MOP</h1>

<p>The current MOP discussion and opinion is mainly about the new class
and method keywords, but a MOP has nothing to do with that. Also not
with Moose or a new object system. A MOP allows the definition of new
behaviour for classes, methods, attributes, types, roles, inheritance
and so on.  How they are initialized, the layout, the behavior. A
definition of alternate object systems. It is mainly proposed to
overcome a Moose problem with anonymous packages, to seperate classes
from stashes.</p>

<p>Introducing a MOP is good if the current object system is not good
enough.  The current object system is not good enough for Moose, and
should be improved. There need to be two seperate discussions. One
about what improvements Moose needs from the traditional stash based
objects (global vs lexical namespaces - anon Packages), and the second
about the MOP itself.</p>

<p>I have no opinion on the mop. Just this: Why bother with a mop before
some basic langauge features are not yet decided upon? Moose does not
even use types properly yet. This smells for premature hooks. But 
pmichaud is highly convinced that a p5 mop is a good thing.</p>

<h1>PDD20 no vivify</h1>

<p>Something like autovivification needs to get added to improve the optree.
As shown in <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-3.html">optimizing-compiler-benchmarks-part-3</a> disabling vivification of arrays but also hashes will lead to compile-time optimizations and dramatic performance improvements, similar to const arrays or hashes, but even better.</p>

<h1>PDD21 no magic</h1>

<p>Similar to no vivify or const lexicals, a lexical no magic pragma can lead to compile-time optimizations and dramatic performance improvements.</p>

<h1>PDD22 slimmer nextstate</h1>

<p>Slimmer nextstate op variants can be optimized at compile-time, which do not: 
reset PL_taint, the stack pointer and FREETMPS.</p>

<h1>PDD23 loop unrolling</h1>

<p>As shown in <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html">optimizing-compiler-benchmarks-part-2</a>
AELEMFAST is about 2 times faster than the generic AELEM, but it needs to know the index at compile-time. This is easy to do for loops.</p>

<p>Unroll loops with known size and lots of AELEM into AELEMFAST accesses 
automatically.</p>

<h1>PDD30 Alternative parser</h1>

<p>The worst part of perl is the parser. It is a hack, it is fast, but 
changing and esp. adding rules in a sane manner is hard, because the parser
deviates in too many ways from a lexer/tokenizer seperation. For adding
new syntax you usually cannot just add the syntax rules to perly.y</p>

<p>Second generating a traditional AST which generates a better optree
(better optimizable, or emit jit or emit native code) is worthwile.</p>

<h1>PDD31 Alternative vm</h1>

<p>Our VM is a stack machine, which handles the stack on the heap.
There are no typed alternatives. </p>

<p>There are integer optimized opts, but they are rarely used, "use
integer" and "my int" can overcome this, but overflow behaviour needs
to be defined. Either slow promotion to number or fast integer wrap,
unsigned or signed. With "my int" this behaviour can be changed.</p>

<p>The VM is simple and easy to XS, but has major problems.
An alternative VM could be based on parrot or vmkit or simply
reuse the existing ops, with a different compiler and different
stack handling.</p>

<p>A c-stack based compiler could arrange the optree as a natively
compiled or jit'ed C program. Before each op call the op arguments
(0-2 SV pointers) are put on the stack, lexicals also as in native
closures, and functions are called natively via cdecl or stdcall, 
depending on if we need varargs.</p>

<p>By using LLVM even a register based (fastcall) layout can be arranged.</p>

<h1>PDD32 Jit</h1>

<p>A jit could solve the run-time decisions for dynamic cases, which are not
solvable at compile-time. But the vm should be JIT friendly. The current
VM is quite jit-friendly, but the ops itself are too dynamic. There need
to be pre-compiled optimized alternatives for certain ops with known 
argument types.</p>

<p>To be practical I'm thinking of adding labels with a naming scheme to most ops,
where a JIT or LLVM could hook into.</p>

<p>Just some random examples from pp.c, to give you an idea.</p>

<pre><code>PP(pp_pos)
{
    dVAR; dSP; dPOPss;

    if (PL_op-&gt;op_flags &amp; OPf_MOD || LVRET) {
      pp_pos_mod:
      SV * const ret = sv_2mortal(newSV_type(SVt_PVLV));
      sv_magic(ret, NULL, PERL_MAGIC_pos, NULL, 0);
      LvTYPE(ret) = '.';
      LvTARG(ret) = SvREFCNT_inc_simple(sv);
      PUSHs(ret);    /* no SvSETMAGIC */
      RETURN;
    }
    else {
      if (SvTYPE(sv) &gt;= SVt_PVMG &amp;&amp; SvMAGIC(sv)) {
        pp_pos_mg:
        const MAGIC * const mg = mg_find(sv, PERL_MAGIC_regex_global);
        if (mg &amp;&amp; mg-&gt;mg_len &gt;= 0) {
          dTARGET;
          I32 i = mg-&gt;mg_len;
          if (DO_UTF8(sv))
            sv_pos_b2u(sv, &amp;i);
          PUSHi(i);
          RETURN;
        }
      }
      RETPUSHUNDEF;
    }
}

PP(pp_refgen)
{
    dVAR; dSP; dMARK;
    if (GIMME != G_ARRAY) {
      pp_refgen_gimme_not_array:
      if (++MARK &lt;= SP)
        *MARK = *SP;
      else
        *MARK = &amp;PL_sv_undef;
      *MARK = refto(*MARK);
      SP = MARK;
      RETURN;
    }
    pp_refgen_gimme_array:
    EXTEND_MORTAL(SP - MARK);
    while (++MARK &lt;= SP)
      *MARK = refto(*MARK);
    RETURN;
}
</code></pre>

<p>Footnotes:</p>

<ol>
<li>"Ertl and Gregg analyze the performance of the following interpreters:
Gforth, OCaml, Scheme48, Yap, Perl, Xlisp. While Gforth, OCaml,
Scheme48 and Yap are categorized as efficient interpreters, Perl and
Xlisp benchmarks are used for comparison purposes as inefficient
interpreters. </li>
</ol>

<p>While efficient interpreters perform with a slowdown by a factor of 10
when compared to an optimizing native code compiler, inefficient
interpreters have a slowdown by a factor of 1000."</p>

<p>M. Anton Ertl and David Gregg. The structure and performance of
efficient interpreters. Journal of Instruction-Level Parallelism,
5:1­25, November 2003. Cited on pages 6 and 7.
<a href="https://students.ics.uci.edu/~sbruntha/cgi-bin/download.py?key=thesis">https://students.ics.uci.edu/~sbruntha/cgi-bin/download.py?key=thesis</a></p>
]]>
        

    </content>
</entry>

<entry>
    <title>Optimizing compiler benchmarks (part 1)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3856</id>

    <published>2012-09-20T21:39:27Z</published>
    <updated>2012-10-04T16:16:05Z</updated>

    <summary>Since my goal is to improve the compiler optimizer (staticly with B::CC, but also the perl compiler in op.c) I came to produce these interesting benchmarks. I took the regex-dna example from &quot;The Computer Language Benchmarks Game&quot; at shootout.alioth.debian.org/ $...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    <category term="benchmark" label="benchmark" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>Since my goal is to improve the compiler optimizer (staticly with B::CC, but also the perl compiler in op.c) I came to produce these interesting benchmarks.</p>

<p>I took the regex-dna example from <em>"The Computer Language Benchmarks Game"</em> at <a href="http://shootout.alioth.debian.org/">shootout.alioth.debian.org/</a></p>

<pre><code>$ time perl t/regex-dna.pl &lt;t/regexdna-input
agggtaaa|tttaccct 0
[cgt]gggtaaa|tttaccc[acg] 3
a[act]ggtaaa|tttacc[agt]t 9
ag[act]gtaaa|tttac[agt]ct 8
agg[act]taaa|ttta[agt]cct 10
aggg[acg]aaa|ttt[cgt]ccct 3
agggt[cgt]aa|tt[acg]accct 4
agggta[cgt]a|t[acg]taccct 3
agggtaa[cgt]|[acg]ttaccct 5

101745
100000
133640

real    0m**0.130s**  /(varying from 0.125 to 0.132)/
user    0m0.120s
sys     0m0.008s
</code></pre>

<p>t/regexdna-input contains 100KB 1600 lines of DNA code, which is used to match
DNA 8-mers and substitute nucleotides for IUB codes.</p>

<pre><code>$ wc t/regexdna-input 
1671   1680 101745 t/regexdna-input
</code></pre>

<p>Perl behaves pretty good in this <a href="http://shootout.alioth.debian.org/u64q/performance.php?test=regexdna">benchmark</a>,
it is actually the fastest scripting language.  But the compiler should
do better, and I had some ideas to try out for the optimizing
compiler. So I thought.</p>

<p>First the simple and stable B::C compiler with -O3:</p>

<pre><code>$ perlcc -O3 -o regex-dna-c -S t/regex-dna.pl
$ time ./regex-dna-c &lt;t/regexdna-input
agggtaaa|tttaccct 0
[cgt]gggtaaa|tttaccc[acg] 3
a[act]ggtaaa|tttacc[agt]t 9
ag[act]gtaaa|tttac[agt]ct 8
agg[act]taaa|ttta[agt]cct 10
aggg[acg]aaa|ttt[cgt]ccct 3
agggt[cgt]aa|tt[acg]accct 4
agggta[cgt]a|t[acg]taccct 3
agggtaa[cgt]|[acg]ttaccct 5

101745
100000
133640

real    0m**0.285s**
user    0m0.272s
sys     0m0.004s
</code></pre>

<p>0.130s vs 0.285s compiled? What's going on? B::C promises faster startup-time and equal run-time.
With -S we keep the intermediate C source to study it.
Let's try B::CC, via -O. Here you don't need a -O3 as B::CC already contains all B::C -O3 optimizations</p>

<pre><code>$ perlcc -O -o regex-dna-cc t/regex-dna.pl
$ time ./regex-dna-cc &lt;t/regexdna-input
...
real    0m**0.267s**
user    0m0.256s
sys     0m0.008s
</code></pre>

<p>Hmm? Let's see what's going on with <strong>-v5</strong>.</p>

<pre><code>$ perlcc -O3 -v5 -S -oregex-dna-c -v5 t/regex-dna.pl

script/perlcc: Compiling t/regex-dna.pl
script/perlcc: Writing C on regex-dna-c.c
script/perlcc: Calling /usr/local/bin/perl5.14.2d-nt -Iblib/arch -Iblib/lib -MO=C,-O3,-Dsp,-v,-oregex-dna-c.c t/regex-dna.pl
Starting compile
 Walking tree
 done main optree, walking symtable for extras
 Prescan 0 packages for unused subs in main::
 %skip_package: B::Stackobj B::Section B::FAKEOP B::C B::C::Section::SUPER B::C::Flags
 B::Asmdata O DB B::CC Term::ReadLine B::Shadow B::C::Section B::Bblock B::Pseudoreg
 B::C::InitSection B::C::InitSection::SUPER
 descend_marked_unused: 
...
%INC and @INC:
 Delete unsaved packages from %INC, so run-time require will pull them in:
 Deleting IO::Handle from %INC
 Deleting XSLoader from %INC
 Deleting B::C::Flags from %INC
 Deleting B::Asmdata from %INC
 Deleting Tie::Hash::NamedCapture from %INC
 Deleting B::C from %INC
 Deleting SelectSaver from %INC
 Deleting IO::Seekable from %INC
 Deleting base from %INC
 Deleting Config from %INC
 Deleting B from %INC
 Deleting Fcntl from %INC
 Deleting IO from %INC
 Deleting Symbol from %INC
 Deleting O from %INC
 Deleting Carp from %INC
 Deleting mro from %INC
 Deleting File::Spec::Unix from %INC
 Deleting FileHandle from %INC
 Deleting Exporter::Heavy from %INC
 Deleting strict from %INC
 Deleting Exporter from %INC
 Deleting vars from %INC
 Deleting Errno from %INC
 Deleting File::Spec from %INC
 Deleting IO::File from %INC
 Deleting DynaLoader from %INC
 %include_package: warnings warnings::register
 %INC: warnings.pm warnings/register.pm
 amagic_generation = 1
 Writing output
 Total number of OPs processed: 323
 NULLOP count: 8
</code></pre>

<p>%include_package contains: <strong>warnings warnings::register</strong>. These two cost a lot of time. 
Carp is also a nice example of code bloat for the static compiler.</p>

<p>Let's try without:</p>

<pre><code>$ perlcc -O3 -Uwarnings -Uwarnings::register -S -oregex-dna-c1  t/regex-dna.pl
$ wc regex-dna-c.c
2293  16084 128953 regex-dna-c.c
$ wc regex-dna-c1.c
1201  7488 57236 regex-dna-c1.c
</code></pre>

<p>128953 down to 57236 bytes. Double size with warnings. So lot of startup-time overhead.</p>

<pre><code>$ perlcc -O -O2 -Uwarnings -Uwarnings::register -S -oregex-dna-cc1 t/regex-dna.pl

$ time ./regex-dna-c1 &lt;t/regexdna-input
...
real    0m**0.284s**
user    0m0.271s
sys     0m0.004s

$ time ./regex-dna-cc1 &lt;t/regexdna-input
...
real    0m**0.266s**
user    0m0.255s
sys     0m0.008s
</code></pre>

<p>Not much gain by stripping warnings, since the main part is run-time, startup-time is usually 
0.010 (uncompiled) to 0.001 (compiled).</p>

<p>Wait, what perl is perlcc calling at all? Hopefully the same as perl. Nope. As it turns out 
perlcc was compiled debugging, and comparing debugging perls with non-debugging explains double run-time. You see it with -v in the output above /usr/local/bin/perl5.14.2d-nt, which is my naming <a href="http://search.cpan.org/dist/App-perlall/">perlall-derived</a> convention for debugging non-threaded.</p>

<p>Recompiling the compiler with normal perl, and re-testing:</p>

<pre><code>$ perl -S perlcc -O3 -Uwarnings -Uwarnings::register -S -oregex-dna-c1  t/regex-dna.pl
$ perl -S perlcc -O -O2 -Uwarnings -Uwarnings::register -S -oregex-dna-cc1  t/regex-dna.pl

$ time ./regex-dna-c1 &lt;t/regexdna-input
...
real    0m0.127s
user    0m0.124s
sys     0m0.000s

$ time ./regex-dna-cc1 &lt;t/regexdna-input
...
real    0m0.121s
user    0m0.120s
sys     0m0.008s
</code></pre>

<p>0.130s vs 0.127s (compiled) vs 0.121s (optimizing compiled) makes now sense. But not much room to improve here, as the regex engine already has a pretty good DFA (not the fastest as re::Engine::RE2 would be faster) but is not optimizable by the optimizing compiler.</p>

<p>Better optimize numbers. Tomorrow. I want to improve <em>stack smashing</em> in B::CC.
Getting rid of copying intermediate C values from the C stack and back to the perl heap.</p>

<p>See the <a href="http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html">arithmetic part 2</a></p>
]]>
        

    </content>
</entry>

<entry>
    <title>Reading binary floating-point numbers (numbers part2)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/09/reading-binary-floating-point-numbers-numbers-part2.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3827</id>

    <published>2012-09-13T16:09:06Z</published>
    <updated>2012-09-17T01:45:48Z</updated>

    <summary>As explained in my previous blog post about parrot and numbers parrot writes floating-pointing numbers in the native format and reads foreign floating-point numbers in different formats. What kind of floating-point formats exist? I&apos;m only studying the commonly used base...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>As explained in my previous blog post <a href="http://blogs.perl.org/users/rurban/2012/09/native-pbc-in-parrot-revived.html">about parrot and
numbers</a>
parrot writes floating-pointing numbers in the native format and reads
foreign floating-point numbers in different formats.</p>

<h1>What kind of floating-point formats exist?</h1>

<p>I'm only studying the commonly used base 2 encodings, usually called
<strong>double</strong>.  Base 10 encodings decimal32, decimal64 and decimal128
also exist.</p>

<p>IEEE-754 defines half-precision (binary16), float (binary32), double
(binary64) and quad float (binary128). It does <strong>not</strong> define the most
popular format <strong>long double</strong>, esp. not the intel extended precision
format, which you normally would associate with long double.  There is
a IEEE-754 long double but this only works on sparc64 and s390.</p>

<p>And since IEEE-754 long double is almost never used and hard to
implement in silicon, other architectures deviated wildly also.</p>

<ul>
<li><p>Intel uses 80-bit (10 byte) for its 12 or 16-byte long double. 
A complete different representation to IEEE-754.</p></li>
<li><p>Powerpc uses two 8-byte doubles for its 16-byte long double. The
result is the sum of the two. double-double.</p></li>
<li><p>MIPS uses a different binary format to represent NaN, and Inf</p></li>
<li><p>I'm not so sure yet about AIX/S390 NaN.</p></li>
<li><p>sparc64 implement IEEE-754 quad float (binary128) properly, it can store it in %q registers, but the arithmetic is done in SW.</p></li>
</ul>

<p>I am choosing little-endian representation here, but big-endian uses
the same algorithm and code. When reading different endianness, just
byteswap it before you are doing the conversion.</p>

<h1>4-byte single FLOAT / IEEE-754 binary32</h1>

<p>This is single precision, and only used in tiny machines. It is not
even fast to compute, unless your HW is optimized to do float. double
is normally faster than float, since double is HW supported.</p>

<p>It uses 4 byte, 32 bit. 8 bits for the exponent, 
and 23 bits for the mantissa. 
It can preserve 7-9 decimal digits.</p>

<pre><code>   sign    1 bit  31
   exp     8 bits 30-23     bias 127
   frac   23 bits 22-0

+[3]----+[2]----+[1]----+[0]----+
S|  exp  |   fraction           |
+-------+-------+-------+-------+
1|&lt;--8--&gt;|&lt;---23 bits----------&gt;|
&lt;-----------32 bits-------------&gt;
</code></pre>

<p>The so-called significand is the 23 fraction bits, plus an implicit leading bit
which is always 1 unless the exponent bits are all 0.
So the total precision is 24 bit, log10(2**24) ≈ 7.225 decimal digits</p>

<p>s: significand</p>

<pre><code>e=0x0,  s=0:  =&gt; +-0.0
e=0xff, s=0:  =&gt; +-Inf
e=0xff, s!=0: =&gt; NaN
</code></pre>

<p>It is particularly odd that the sign + exponent does not align to the 
first byte, the exponent overlaps to the last bit of the second byte.
So you have to mask off the exp and fraction.</p>

<p>A simple conversion to double is best done as compiler cast. Every compiler
can do float and double.</p>

<pre><code>cvt_num4_num8(unsigned char *dest, const unsigned char *src)
{
    float f;
    double d;
    memcpy(&amp;f, src, 4);
    d = (double)f;
    ROUND_TO(d, double, 7);
    memcpy(dest, &amp;d, 8);
}
</code></pre>

<p>This is a problematic case. double has more precision than float, so the
result needs to be rounded to 7-9 digits.</p>

<h1>8-byte DOUBLE float / IEEE-754 binary64</h1>

<p>This is double precision, the most popular format. 
It uses 8 byte, 64 bit. 11 bits for the exponent, 
and 52 bits for the mantissa. 
It can preserve 15-16 decimal digits, DBL_DIG.</p>

<pre><code>   sign    1 bit  63
   exp    11 bits 62-52     bias 1023
   frac   52 bits 51-0      (53 bit precision implicit)

+[7]----+[6]----+[5]----+[4]----+[3]----+[2]----+[1]----+[0]----+
S|   exp   |                  fraction                          |
+-------+-------+-------+-------+-------+-------+-------+-------+
1|&lt;---11--&gt;|&lt;---------------------52 bits----------------------&gt;|
&lt;---------------------------64 bits-----------------------------&gt;
</code></pre>

<p>Precision: log10(2**53) ≈ 15.955 decimal digits. The first fraction bit
is assumed to be 1 unless the exponent is 0x7ff</p>

<p>s: significand</p>

<pre><code>e=0x0,   s=0:  =&gt; +-0.0
e=0x7ff, s=0:  =&gt; +-Inf
e=0x7ff, s!=0: =&gt; NaN

3ff0 0000 0000 0000   = 1
4000 0000 0000 0000   = 2
8000 0000 0000 0000   = -0
7ff0 0000 0000 0000   = Inf
fff0 0000 0000 0000   = -Inf
3df5 5555 5555 5555   ~ 1/3
</code></pre>

<p>Read into single float:</p>

<pre><code>cvt_num8_num4(unsigned char *dest, const unsigned char *src)
{
    float f;
    double d;
    memcpy(&amp;f, src, 8);
    f = (float)d;
    memcpy(dest, &amp;d, 4);
}
</code></pre>

<p>No rounding problems.</p>

<p>Read into native long double:</p>

<pre><code>cvt_num8_numld(unsigned char *dest, const unsigned char *src)
{
    double d;
    long double ld;
    memcpy(&amp;d, src, 8);
    ld = (long double)d;
    memcpy(dest, &amp;ld, sizeof(long double));
}
</code></pre>

<p>No rounding problems, as the compiler cast should handle that.  Note
that "native long double" can be up to 6 different binary
representations, i386, amd64+i64, ppc, mips, aix or sparc64+s390.</p>

<h1>80bit, 10-byte intel extended precision LONG DOUBLE</h1>

<p>This is stored as 12-byte on i386 or 16 byte on x86_64 and itanium,
however internally the format is still the old x87 extended precision
10 byte.</p>

<p>It uses 10 byte, 80 bit. 15 bits for the exponent, 
and 63 bits for the mantissa. 
It can preserve 17-19 decimal digits, LDBL_DIG.</p>

<pre><code>   padding 2 or 4 byte (i386/x86_64)
   sign    1 bit  79
   exp    15 bits 78-64     bias 16383
   intbit  1 bit  63        set if normalized
   frac   63 bits 62-0

+[11]---+[10]---+[9]----+[8]----+[7]----+[6] ...+[1]----+[0]----+
|   unused      |S|     Exp     |i|          Fract              |
+-------+-------+-------+-------+-------+--- ...+-------+-------+
|&lt;-----16------&gt;|1|&lt;-----15----&gt;|1|&lt;---------63 bits-----------&gt;|
&lt;--------------&gt;|&lt;----------------80 bits-----------------------&gt;
</code></pre>

<p>Precision: log10(2**63) ≈ 18.965 decimal digits. The first fraction bit
is here explicitly used, not hidden as before.</p>

<p>s: significand = frac. Looks like the Norwegian Fjord designer also helped out here.
Note that this was a private clean-room design, not design by committee.</p>

<pre><code>e=0x0,    i=0, s=0:  =&gt; +-0.0
e=0x0,    i=0, s!=0: =&gt; denormal
e=0x0,    i=1:       =&gt; pseudo denormal (read, but not generated)
e=0x7fff, bits 63,62=00, s=0  =&gt; old +-Inf, invalid since 80387
e=0x7fff, bits 63,62=00, s!=0 =&gt; old NaN, invalid since 80387
e=0x7fff, bits 63,62=01:      =&gt; old NaN, invalid since 80387
e=0x7fff, bits 63,62=10, s=0  =&gt; +-Inf
e=0x7fff, bits 63,62=10, s!=0 =&gt; NaN
e=0x7fff, bits 63,62=11, s=0  =&gt; silent indefinite NaN (internal Inf, 0/0, ...)
e=0x7fff, bits 63,62=11, s!=0 =&gt; silent NaN

i=0: =&gt; denormal (read, but not generated)
i=1: =&gt; normal
</code></pre>

<p>Reading this number into a double is tricky:</p>

<pre><code>cvt_num10_num8(unsigned char *dest, const unsigned char *src)
{
    int expo, i, sign;
    memset(dest, 0, 8);
    /* exponents 15 -&gt; 11 bits */
    sign = src[9] &amp; 0x80;
    expo = ((src[9] &amp; 0x7f)&lt;&lt; 8 | src[8]);
    if (expo == 0) {
      nul:
        if (sign)
            dest[7] |= 0x80;
        return;
    }
    expo -= 16383;       /* - bias long double */
    expo += 1023;        /* + bias for double */
    if (expo &lt;= 0)       /* underflow */
        goto nul;
    if (expo &gt; 0x7ff) {  /* inf/nan */
        dest[7] = 0x7f;
        dest[6] = src[7] == 0xc0 ? 0xf8 : 0xf0 ;
        goto nul;
    }
    expo &lt;&lt;= 4;
    dest[6] = (expo &amp; 0xff);
    dest[7] = (expo &amp; 0x7f00) &gt;&gt; 8;
    if (sign)
        dest[7] |= 0x80;
    /* long double frac 63 bits =&gt; 52 bits
       src[7] &amp;= 0x7f; reset intbit 63 */
    for (i = 0; i &lt; 6; ++i) {
        dest[i+1] |= (i==5 ? src[7] &amp; 0x7f : src[i+2]) &gt;&gt; 3;
        dest[i] |= (src[i+2] &amp; 0x1f) &lt;&lt; 5;
    }
    dest[0] |= src[1] &gt;&gt; 3;
}
</code></pre>

<p>No rounding problems.</p>

<p>Reading an intel long double into a IEEE-754 quad double (__float128)
is similar, but the difference counts.</p>

<pre><code>cvt_num10_num16(unsigned char *dest, const unsigned char *src)
{

    int expo, i;
    memset(dest, 0, 16);
    dest[15] = src[9]; /* sign + exp */
    dest[14] = src[8];
    expo = ((src[9] &amp; 0x7f)&lt;&lt; 8 | src[8]);
    expo -= 16383;
    /* On Intel expo 0 is allowed */
    if (expo &lt;= 0)     /* underflow */
        return;
    if (expo &gt; 0x7ff)  /* overflow, inf/nan */
        return;
    /* shortcut the zero mantissa check */
#if __WORDSIZE == 64
    if (*(const uint64_t*)src != 0x8000000000000000LU)
#else
    if (*(const uint32_t*)src || *(const uint32_t*)&amp;src[4] != 0x80000000U)
#endif
    {
      for (i = 13; i &gt; 5; i--) {
          dest[i] |= ((i==13 ? src[7] &amp; 0x7f : src[i-5]) &lt;&lt; 1)
                  | (src[i-6] &amp; 0x7f) &gt;&gt; 7;
      }
    }
    ROUND_TO((__float128)*dest, __float128, 18);
}
</code></pre>

<p>Need to properly round the result into the Intel LDBL_DIG precision (18).
Cut off the rest.</p>

<h1>Powerpc 16-byte LONG DOUBLE aka "double-double"</h1>

<p>With <code>-mlong-double-128</code> a long double on ppc32 or ppc64 is stored in 16
bytes. It simple stores two double numbers, a "head" and a "tail", one
after another. The head being rounded to the nearest double and the tail
containing the rest.</p>

<p>It stores 106 bits significants (2*53), but the range is limited to the
double format, 11-bit. It can preserve 31 decimal digits, LDBL_DIG.
log10(2**106) ≈ 31.909 decimal digits</p>

<p>Reading such a number is trivial, just return the sum of the two 8-byte
doubles. There is only one special case: -0.0</p>

<pre><code>cvt_num16ppc_num8(unsigned char *dest, const unsigned char *src)
{
    double d1, d2;
    long double ld;
    memcpy(&amp;d1, src, 8);
    memcpy(&amp;d2, src+8, 8);
    ld = (d2 == -0.0 &amp;&amp; d1 == 0.0) ? -0.0 : d1 + d2;
    d1 = (double)ld;
    memcpy(dest, &amp;d1, 8);
}
</code></pre>

<p>Converting a foreign floating-point number to this format is also trivial.
You don't have to care about splitting up the number into two, you can
simply cast it.</p>

<pre><code>cvt_num10_num16ppc(unsigned char *dest, const unsigned char *src)
{
    double d;
    long double ld;
    cvt_num10_num8((unsigned char *)&amp;d, src);
ld = (long double)d;
    ROUND_TO(ld, long double, 18);
    memcpy(dest, &amp;ld, 16);
}
</code></pre>

<p>You just have to take care of proper rounding, as with every number read
into a more precise format. I hardcoded 18 here as a ppc does not know 
about intel long doubles.</p>

<h1>16-byte quadruple double / IEEE-754 binary128</h1>

<p>This is a very rare format, native long double on sparc64 and S/390
CPUs, and SW simulated since GCC 4.6 as __float128. I have no idea
why Intel did not adopt that yet.  Sparc has %q0 registers since V8
(1992), but a sparc64 has no direct math support in HW. S/390 G5
supports it since 1998.</p>

<p>It uses 16 byte, 128 bit. 15 bits for the exponent as with the intel
80bit long double, and 112 bits for the mantissa.  It can preserve 34
decimal digits, FLT128_DIG.</p>

<pre><code>   sign   1  bit 127
   exp   15 bits 126-112   bias 16383
   frac 112 bits 111-0     (113 bits precision implicit)

+[15]---+[14]---+[13]---+[12]---+[11]---+[10]---+[9] .. +[0]----+
S|      exp     |             fraction                          |
+-------+-------+-------+-------+-------+-------+--- .. +-------+
1|&lt;-----15-----&gt;|&lt;---------------------112 bits----------------&gt;|
&lt;--------------------------128 bits-----------------------------&gt;
</code></pre>

<p>Precision: log10(2**113) ≈ 34.016 decimal digits. The first fraction bit
is assumed to be 1 unless the exponent is 0x7fff.</p>

<p>s: significand</p>

<pre><code>e=0x0,    s=0:  =&gt; +-0.0
e=0x7fff, s=0:  =&gt; +-Inf
e=0x7fff, s!=0: =&gt; NaN

3fff 0000 0000 0000 0000 0000 0000 0000   = 1
c000 0000 0000 0000 0000 0000 0000 0000   = -2
7fff 0000 0000 0000 0000 0000 0000 0000   = Inf
3ffd 5555 5555 5555 5555 5555 5555 5555   ≈  1/3
</code></pre>

<p>Since this format uses the same exponents as an intel long double
this conversion is trivial. Just the intel normalization bit must be set.</p>

<pre><code>cvt_num16_num10(unsigned char *dest, const unsigned char *src)
{
    memset(dest, 0, sizeof(long double));
    /* simply copy over sign + exp */
    dest[8] = src[15];
    dest[9] = src[14];
    /* and copy the rest */
    memcpy(&amp;dest[0], &amp;src[0], 8);
    dest[7] |= 0x80;  /* but set integer bit 63 */
}
</code></pre>

<p>No rounding problems.</p>

<p><em>Maybe I'll find time to find the remaining deviations for MIPS and
AIX long double formats.  I do not care about old non-IEEE IBM
extended precision formats, such as on S/360 and S/370 machines.</em></p>

<h1>Further Hacks</h1>

<p>If you are having fun with this post you seriously want to look at this hack:
<a href="http://blog.quenta.org/2012/09/0x5f3759df.html">http://blog.quenta.org/2012/09/0x5f3759df.html</a> which uses the float format for a fast inverse square root function.</p>

<p>This is the function from Quake McCarmack fame that bit-casts a floating-point value to an int, does simple integer arithmetic, and then bit-casts the result back:</p>

<pre><code>int i = * (int*)&amp;x; // evil floating point bit level hacking
i = 0x5f3759df - (i &gt;&gt; 1); // what the fuck?
x = * (float*)&amp;i;
x = x * (1.5F - (x * 0.5F * x * x);
</code></pre>

<p>The history and correct code is <a href="http://eggroll.unbsj.ca/rsqrt/rsqrt.pdf">here</a> <em>(turns out 0x5f375a86 is better)</em> and more tricks are <a href="http://rufus.hackish.org/~rufus/FPtricks.pdf">here</a></p>
]]>
        

    </content>
</entry>

<entry>
    <title>native_pbc in parrot revived (numbers part1)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/09/native-pbc-in-parrot-revived.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3824</id>

    <published>2012-09-13T04:43:36Z</published>
    <updated>2012-09-14T15:19:34Z</updated>

    <summary>The design for parrot, the vm (virtual machine) under rakudo (perl6), envisioned a platform and version compatible, fast, binary format for scripts and modules. Something perl5 was missing. Well, .pbc and .pmc from ByteLoader serves this purpose, but since it...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    <category term="parrot" label="parrot" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>The design for <strong>parrot</strong>, the vm (virtual machine) under <strong>rakudo</strong> (perl6),
envisioned a platform and version compatible, fast, binary format for
scripts and modules. Something perl5 was missing. Well, .pbc and .pmc
from <a href="http://search.cpan.org/dist/B-C/">ByteLoader</a> serves this purpose,
but since it uses source filters it is not that fast.</p>

<p>Having a binary and platform independent compiled format can skip the
parsing compiling and optimizing steps each time a script or module is
loaded.</p>

<p>Version compatiblity was broken with the 1.0 parrot release, that's why
I left the project in protest a few years ago. Platform compatibility
is still a goal but seriously broken, because the tests were
disabled, and nobody cared.</p>

<p>Since I have to wait in perl5 land until p5p can decide and discuss on
a syntax for the upcoming improvements which I can then implement in
the type optimizers and the static
<a href="http://search.cpan.org/dist/B-C/">B::CC</a> compiler, I went back to
parrot. p5p needs a few years to understand the most basic performance
issues first. The basic obstacles in parrot were gone, parrot is
almost bug free and has most features rakudo needs, but is lacking
performance.</p>

<h2>Platform compatibility</h2>

<p>So I tried to enable platform compatibility again. I wrote and fixed
most of the native_pbc code several years ago until 1.0, and only a
little bit of bitrot crept in. Platform-compatible means, any platform
can write a pbc and any other platform should be able to read this
format.  Normally such a format would require a fixed format, not so
.pbc.  The pbc format is optimized for native reads and writes, so all
integers, pointers and numbers are stored in native format, and when
you try to open such a file on a different platform converters will
try to read those 3 types. integers and pointers can be 4 or 8 byte,
little or big endian.  This is pretty easy to support.</p>

<p>The problem comes with numbers.  Supported until now was <strong>double</strong>
and the intel specific <strong>long double</strong> format. The main problem is
that the intel long double format is a tricky and pretty non-standard
format. It has 80 bits, which is 10 bytes, but the numbers are stored
with padding bytes, 12 byte on 32-bit and 16 byte on 64-bit.  2 bytes
or 6 bytes padding. Here Intel got padding right but in the normal
compiler ABI Intel is the only processor which does not care about
alignment. Which leads to slow code, and countless alignment problems
with fast SSE integer code.  Most other processors require stricter
alignment to be able to access ints and numbers faster. Intel code is
also not easy to compile on better processors, because they fail on
unaligned access. You cannot just access every single byte in a stream
at will.  At least you should not.</p>

<p>As it turns out sparc64 and s390 (AIX) uses for long double the
IEEE-754 standard quad double 16-byte binary format, which is the best
so far we can get, GCC since 4.6 supports the same format as
__float128 (via the
<a href="http://gcc.gnu.org/onlinedocs/libquadmath/">quadmath</a> library), and
finally powerpc has its own third <a href="https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man3/float.3.html">long double
format</a>
with <code>-mlong-double-128</code>, which is two normal 8-byte double one after
another, and the result is the sum of the two, "head" and "tail". It's
commonly called ppc "double-double".  For smaller devices the typical
format is single float, 4 bytes.  Thanksfully in IEEE-754 standard
format. All compilers can at least read and write it. But when it
comes to va_arg() accessing ... arguments from functions, gcc fails
to accept float.</p>

<p>So after rewriting the test library I still found some bugs in the
code.</p>

<p>So I fixed a lot of those old bugs, esp. various intel long double
confusions: with the padding bytes, 12 or 16 bytes, and a special
normalize bit at 63, which is always 1 when a valid number was written
to disc. So when reading such a number this bit is not part of the
mantissa.  Documentation for these formats was also wrong. And I added
support for all missing major number formats to parrot, <code>float</code>, <code>double</code>,
<code>long double</code> in various variants: FLOATTYPE_10 for intel,
FLOATTYPE_16PPC for the powerpc double-double format, and finally
FLOATTYPE_16 for IEEE-754 quadmath, i.e. <code>__float128</code> or sparc64/s390 long
double.</p>

<h2>sparc64</h2>

<p>The biggest obstacle for progress was always the lack of a UltraSparc
to test the last number format. As it turns out a simple darwin/ppc
Powerbook G4 was enough to generate all needed formats, together with
a normal Intel multilib linux.  My colleague Erin Schoenhals gave me
her old powerbook for $100.  The Powerbook could generate float,
double, long double which is really a 16ppc double-double and gcc 4.6
could generate __float128, which is the same format as a 64bit sparc
long double.</p>

<h2>Good enough tests</h2>

<p>One important goal was a stable test suite, that means real errors
should be found, invalid old .pbc files should be skipped (remember,
pbc is not version compatible anymore) and numbers only differing in
natural precison loss while converting a number should be compared
intelligently. Interestingly there does not even exist a good perl5
<a href="http://search.cpan.org/dist/Test-More/">Test::More</a> or
<a href="http://search.cpan.org/dist/Test-Simple/">Test::Builder</a> <em>numcmp</em>
method to compare floating point numbers in the needed
precision. There is a
<a href="http://search.cpan.org/dist/Test-Number-Delta">Test::Number::Delta</a>
on CPAN, but this was not good enough. It only uses some epsilon, not
the number of valid precision digits, and the test is also numerically
not stable enough. And really, number comparisons should be in the
standard. I added a <a href="https://github.com/parrot/parrot/commit/4aaf1cf5c3473922cac232dfa925b0fe86bba7b4">Test::Builder::numcmp</a> method locally. It works on lines of strings,
but could be easily changed to take an arrayref and single number also.</p>

<h2>Expected precision loss</h2>

<p>So what is the expected precision loss when reading e.g. a float with
intel long double? A float claims to hold 7 digits without loss,
<code>FLT_DIG</code>, so such a conversion should keep 7 digits precision, and the
test only needs to test the 7 first digits. The precision holds 24
bit, <code>log10(2**24) ≈ 7.225</code> decimal digits. So <code>123456789.0</code> stored as
float, converted to long double needs to be compared with something
like <code>/^1234567\d\*/</code> if done naively. It can be <code>123456755.0</code> or any other
number between <code>123456700.0</code> and <code>123456799.4</code>. Better round the last significant
digit.</p>

<p>But first at all, what is the right precision to survive a number ->
string -> number round trip? Numbers need to be sprintf-printed
precise enough and need to be restored from strings precise
enough. Printing more digits than supported will lead to unprecise
numbers when being read back, and the same when printing not enough
digits. The C library defines various defines for this number:
<code>FLT_DIG=7</code>, <code>DBL_DIG=16</code>, <code>LDBL_DIG=18</code>, <code>FLT128_DIG=34</code>.  But
better than trusting your C library vendor is a configure probe, now
in
<a href="https://github.com/parrot/parrot/blob/native_pbc2/config/auto/format.pm">auto::format</a>.
So parrot outsmarts perl5 now by testing for the best and most precise
sprintf format to print numbers. As experimentally found out, this
number is usually one less than the advertised *_DIG
definition. double uses <code>%.15g</code>, not <code>%.16g</code>, float uses <code>%.6g</code>, and
so on. But this might vary on the used CPU and C library. Before
parrot used hardcoded magic numbers. And wrongly.</p>

<p>One might say, why bother? Simply stringify it when exporting it. 
Everything is already supported in your c library. 
Two counter arguments:</p>

<ol>
<li><p><strong>Fixed-width size</strong>. Native floats are easily stored in fixed-width
records, strings not.  Accessing the x-th float on disc is
significantly faster with fixed size, and native floats are also
significantly smaller than strings.</p></li>
<li><p><strong>Precision loss</strong>: With stringification you'll loose precision. In my
configure probe I verified that we always loose the last digit. The
previous code in imcc had this loss hardcoded, 15 instead of 16.</p></li>
</ol>

<h2>Storage</h2>

<p>parrot's Configure also checks now for the native floattype and its
size.  Before a pbc header only checked the size of a number, now the
type is different from the size.  The size of long double can be 10,
12, or 16 and can mean completely different binary representations.</p>

<p>As next improvement, parrot used to store the parrot version triple in
the ops library header inside the pbc format. But whenever a ops
library changed, the other version number needs to be changed, the
<code>PBC_COMPAT</code> version number, or simply the bytecode version. This
needs to be done for format changes and a change of native
ops. Because parrot stores and accesses ops only by index, not by
name, and sorts its ops on every change. This was my main critic when
I left parrot with 1.0. Because it was never thought this way. Old ops
should be readable by newer parrots, just newer ops cannot not be
understood. So new ops need to be added to the end.</p>

<p>So now the bytecode version is stored in the ops library header, and
newer parrot versions with the same bytecode version can still read
old pbc files. Older bytecode versions not yet, as it needs to revert
the policy change from v1.0, back to pre-v1.0.</p>

<h2>mk_native_pbc</h2>

<p>The script to generate the native pbc on every <code>PBC_COMPAT</code> change
was pretty immature. I wrote it several years ago. I rewrote it, still
as shell script, but removed all bashisms, and enabled generating and
testing all supported floting point formats in one go with custom perl
Configure options <code>tools/dev/mk_native_pbc [--my-config-options...]</code>, 
or when called with <code>tools/dev/mk_native_pbc --noconf</code>
just generate and test the current configuration.</p>

<h2>Tests again</h2>

<p>As it turns out the tested numbers were also horrible. Someone went
the easy way and tested only some exponents in the numbers, but the
mantissas were always blank zeros. Numbers can be signed (there's one
to two sign bits in the format), there can be <code>-0.0</code>, <code>-Inf</code>, <code>Inf</code>, <code>NaN</code>,
and the mantissa is sometimes tricky to convert between various
formats. The new number test has a now some such uncommon numbers to
actually test the converters and expected precision loss.</p>

<h2>Too much?</h2>

<p>With the 5 types - 4 (float), 8 (double), 10 (intel long double),
16ppc, and 16 (float128) - and little&lt;->big endian, there is
combinatorial explosion in the number of converters. So I removed 50%
of them by converting endian-ness beforehand, some of the easy
conversion are best done by compiler casts whenever the compiler
supports both formats, 16ppc conversions are pretty trivial to do, so
there are only a few tricky conversions left. Mainly with the intel
long double format. The 5*4 converters are still linked function
pointers, assigned at startup-time. So it's maintainable and fast.</p>

<h2>Optimizations</h2>

<p>More optimizations were done by using more than single byte
operations, such as builtin native <code>bswap</code> operations (also a new
probe), and int16_t, int32_t and int64_t copy and compare ops. perl5
is btw. also pretty unoptimized in this regard.  Lots of unaligned
single-byte accesses. The worst of all scripting languages as measured
by
<a href="http://code.google.com/p/address-sanitizer/wiki/AddressSanitizer">AddressSanitizer</a>. A
typical register is 32bit or 64 bit wide, the whole width should be
used whenever possible. For the beginning the perl5 hash function is
only fast on 32bit cpus. Fast checks could trade speed for size, not
to bitmask every single bit. Maybe combine the most needed bits into
an aligned short. But as long as there are unhandled really big
optimization goals (functions, method calls, types, const) these micro
optimizations just stay in my head.</p>

<p>Code on <a href="https://github.com/parrot/parrot/commits/native_pbc2">https://github.com/parrot/parrot/commits/native_pbc2</a></p>

<p>In a <a href="http://blogs.perl.org/users/rurban/2012/09/reading-binary-floating-point-numbers-numbers-part2.html">followup post</a> I'll explain for the general community reading binary
representations of numbers. Reading foreign floats would even deserve
a new C library.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>ThreadSanitizer</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/rurban/2012/08/threadsanitizer.html" />
    <id>tag:blogs.perl.org,2012:/users/rurban//39.3672</id>

    <published>2012-08-09T17:04:12Z</published>
    <updated>2012-08-10T15:06:21Z</updated>

    <summary>Time for a new tool - tsan http://code.google.com/p/data-race-test/wiki/ThreadSanitizer See this pasty http://pastebin.com/gKRDTkh3 how I successfully analyzed a hairy race in parrot&apos;s new threads branch. This problem was a big blocker for parrot developers, as described in http://lists.parrot.org/pipermail/parrot-dev/2012-August/007078.html. To catch common...</summary>
    <author>
        <name>Reini Urban</name>
        <uri>http://rurban.xarch.at/</uri>
    </author>
    
    <category term="threads" label="threads" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/rurban/">
        <![CDATA[<p>Time for a new tool - <strong>tsan</strong> </p>

<p><a href="http://code.google.com/p/data-race-test/wiki/ThreadSanitizer">http://code.google.com/p/data-race-test/wiki/ThreadSanitizer</a></p>

<p>See this pasty <a href="http://pastebin.com/gKRDTkh3">http://pastebin.com/gKRDTkh3</a>
how I successfully analyzed a hairy race in parrot's new threads branch.
This problem was a big blocker for parrot developers, as described in <a href="http://lists.parrot.org/pipermail/parrot-dev/2012-August/007078.html">http://lists.parrot.org/pipermail/parrot-dev/2012-August/007078.html</a>.</p>

<p>To catch common memory errors you simply have to use the new <code>clang -faddress-sanitizer</code>, along these instructions in <a href="http://perl5.git.perl.org/perl.git/blob/HEAD:/pod/perlhacktips.pod#l1163">perlhacktips.pod</a> or use valgrind.
But to detect race conditions <em>tsan</em> is another gold-mine from google's moscow lab. This is the old valgrind based version:</p>

<p><code>cd ~/bin;</code>
<code>wget http://build.chromium.org/p/client.tsan/binaries/tsan-r4356-amd64-linux-self-contained.sh;</code>
<code>chmod +x tsan-r4356-amd64-linux-self-contained.sh;</code>
<code>cd -</code></p>

<p><em>*runloop_id_counter race with threads, t/pmc/task.t *</em>  <a href="https://github.com/parrot/parrot/issues/808">https://github.com/parrot/parrot/issues/808</a></p>

<pre><code>tsan-r4356-amd64-linux-self-contained.sh ./parrot t/pmc/task.t

==2162== ThreadSanitizer, a data race detector
==2162== Copyright (C) 2008-2010, and GNU GPL'd, by Google Inc.
==2162== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info
==2162== Command: ./parrot t/pmc/task.t
==2162== 
==2162== ThreadSanitizerValgrind r4356: hybrid=no
==2162== INFO: Allocating 256Mb (32 * 8M) for Segments.
==2162== INFO: Will allocate up to 640Mb for 'previous' stack traces.
Valgrind: ignoring NaCl's mmap(84G)
Valgrind: ignoring NaCl's mmap(84G)
Valgrind: ignoring NaCl's mmap(84G)
1..6
ok 1 - initialized
ok 2 task1 ran
ok 3 task2 ran
ok 4 sub1 ran
==2162== INFO: T4 has been created by T0. Use --announce-threads to see the creation stack.
==2162== INFO: T5 has been created by T0. Use --announce-threads to see the creation stack.
==2162== WARNING: Possible data race during write of size 4 at 0x6373A60: {{{
==2162==    T5 (L{}):
==2162==     #0  reset_runloop_id_counter /usr/src/parrot/threads/src/call/ops.c:155
==2162==     #1  Parrot_thread_outer_runloop /usr/src/parrot/threads/src/thread.c:317
==2162==     #2  __asan::AsanThread::ThreadStart /home/rurban/Perl/parrot/threads/parrot
==2162==   Concurrent write(s) happened at (OR AFTER) these points:
==2162==    T4 (L{}):
==2162==     #0  new_runloop_jump_point /usr/src/parrot/threads/src/call/ops.c:191
==2162==     #1  runops /usr/src/parrot/threads/src/call/ops.c:88
==2162==     #2  Parrot_pcc_invoke_from_sig_object /usr/src/parrot/threads/src/call/pcc.c:338
==2162==     #3  Parrot_ext_call /usr/src/parrot/threads/src/extend.c:158
==2162==     #4  Parrot_Task_invoke /usr/src/parrot/threads/src/pmc/task.c:168
==2162==     #5  Parrot_pcc_invoke_from_sig_object /usr/src/parrot/threads/src/call/pcc.c:330
==2162==     #6  Parrot_ext_call /usr/src/parrot/threads/src/extend.c:158
==2162==     #7  Parrot_cx_next_task /usr/src/parrot/threads/src/scheduler.c:222
==2162==     #8  Parrot_thread_outer_runloop /usr/src/parrot/threads/src/thread.c:319
==2162==     #9  __asan::AsanThread::ThreadStart /home/rurban/Perl/parrot/threads/parrot
==2162==   Address 0x6373A60 is 0 bytes inside data symbol "runloop_id_counter"
==2162==    Race verifier data: 0x539B6AC,0x539A8AE
==2162== }}}
</code></pre>

<p>This announces a possible data race when writing <code>runloop_id_counter</code> ("Possible data race during write of size 4 at 0x6373A60") in the threads T4 and T5 with the listed backtraces.</p>

<p>The logical error error was confirmed on <a href="http://irclog.perlgeek.de/parrot/2012-08-09#i_5887193">irc</a> by the author <em>nine</em>, Stefan Seifert (another fellow austrian):</p>

<p><em>rurban</em>: I think for practical purposes only the new #810 is blocking threads now.
Should be easy to fix. Is niner somewhere around?</p>

<p><em>rurban</em>: Because there is a wrong assumption in his code and paper. 
threads.c:313 "there can be no active runloops at this point" - there is</p>

<p><em>rurban</em>: I'm building now with -DRUNLOOP_TRACE</p>

<p><em>benabik</em>: rurban++</p>

<p><em>nine</em>: rurban: I am around</p>

<p><em>rurban</em>: Hi, Do you understand the #810 runloop<em>id</em>counter race with threads,
 t/pmc/task.t ?</p>

<p><em>nine</em>: rurban: you are absolutely right. That comment is from the time when there was
only green threads. The runloop id counter should be moved into the interp. There's
absolutely no reason to share it between threads.</p>

<p><em>rurban</em>: :) sigh</p>

<p><em>rurban</em>: So this is the remaining blocker</p>

<p><em>rurban</em>: nine: Can you fix this so we can merge threads into master?</p>

<p><em>nine</em>: Well actually the nci.t hangs got me worried. I guess the timer stuff is racy
on some platforms. Tried to reproduce it today but there's just no way to get my
linux boxes to show any fault.</p>

<p><em>nine</em>: rurban: on it</p>

<p><em>nine</em>: I'll just get me a cup of coffee and start hacking</p>

<p>It cannot analyze the new <strong>sleep</strong> deadlock in <code>t/pmc/nci_37.pasm</code> in the the threads branch, because parrot is looping endlessly in a while loop.
Here you really have to gdb into it and bt the two threads or use darwin's Activity Monitor to list the two threads waiting for each other.</p>

<p>See this bug <strong>t/pmc/nci.t tests are fragile and broken by design</strong> <a href="https://github.com/parrot/parrot/issues/808">https://github.com/parrot/parrot/issues/808</a>:</p>

<p>But for the common case to detect possible races within certain timing ranges 
tsan is very good. And without tsan it is also hard to reproduce the hang on normal HW at all because it only shows up on super slow HW, like a Mac Powerbook 4/ppc or a mips32, essentially a similated IRIX Indigo 2 with 200MHz.</p>

<p><a href="http://code.google.com/p/thread-sanitizer/wiki/PopularDataRaces">http://code.google.com/p/thread-sanitizer/wiki/PopularDataRaces</a> contains descriptions for the most popular data races:</p>

<ul>
<li>Simple race
<ul>
<li>Thread-hostile reference counting</li>
</ul></li>
<li>Race on a complex object</li>
<li>Notification</li>
<li>Publishing objects without synchronization</li>
<li>Initializing objects without synchronization</li>
<li>Reader Lock during a write</li>
<li>Race on bit field</li>
<li>Double-checked locking</li>
<li>Race during destruction</li>
<li>Data race on vptr</li>
<li>Race on free</li>
<li>Race during exit</li>
</ul>

<p>Note that currently TSan v2 is being reimplemented in LLVM, similar to ASan. 
This is compiler-based and is already in the LLVM trunk. 
See <a href="http://clang.llvm.org/docs/ThreadSanitizer.html">http://clang.llvm.org/docs/ThreadSanitizer.html</a> and <a href="http://code.google.com/p/thread-sanitizer/">http://code.google.com/p/thread-sanitizer/</a>.</p>

<p>It may be not mature enough yet and unfortunately won't help you if
you have races in non-instrumented code (inline assembly, JIT code,
system libraries), but it is way faster than the Valgrind-based TSan
and might be better for large apps.
For simple small testcases the Valgrind-based TSan is good enough, but if you need it on a non-valgrind, non-pin supported platform or with bigger apps go with LLVM trunk.</p>

<p>Note that there's a windows version for TSan v1, based on <a href="http://www.pintool.org/">PIN</a>.</p>
]]>
        

    </content>
</entry>

</feed>
