Rewriting B:Deparse and Reintroducing B::DeparseTree and (part 1)

I will be giving a talk on B::DeparseTree and its use in a debugger Devel::Trepan at the upcoming YAPC 2018 in Glasgow. As a result, I have been completely refactoring B::DeparseTree and have a number of thoughts on it and B::Deparse.

Here are some of them. This first part focuses more about B::Deparse and how it could be (or in a sense is) being rewritten. The second part if I get around to writing it will be about the cool features of B::DeparseTree from an application.

Introduction

As someone who make a lot of mistakes, I’ve long wanted to improve the precision of debugging and error reporting. Through Perlmonks, I was directed upon the idea of using the OP address as means to get more detailed information of where the program is. I was also pointed to B::Deparse.

At the time, I was elated. Perl5 uses a tree interpreter to run your Perl script. So what a decompiler does here to recreate the source code, and in in particular what B::Deparse does, is walk the B::OP tree, building up strings as it traverses. If you start from the root of the tree, you get everything. If instead you start from some other node, you get the text that this portion is responsible for. So I thought all I need to do is augment B::Deparse a little bit. True, but…

Deparse problems

The biggest issue with B::Deparse is it is one 6.5K line file. There are three test files, of which two of them are significant. The sheer number of tests are huge too - hundreds of tests in one, and a couple thousand in the other. The reason for the large number of tests in the latter test is a result of Perl having so many builtin functions - over 100 of them. Each of these is an opnode; and each function can be tried several times in different ways.

That there is such extensive testing is great, but given Perl testing’s default behavior of running all tests unconditionally, the slightest error meant that I’d see a spew of hundreds to thousands of lines or errors. (Does prove or Test::More have simple way to stop on the first failure?)

So it was important to get tests under control starting with some quick little unit tests, and bailing if those fail. Unit tests and B::Deparse?

When a program is big and monolithic, it is often not modular. When it is not modular, it’s hard to test it. That is most likely why both tests for B::Deparse that require a lot of setup.

The tests are also frail in the face of the formatting changes to B::Deparse. and can cause many if not all of the hundreds of tests to fail.

How we do better? I’ve mentioned unit tests. Later I’ll describe something else that eliminates the frailness of testing against a particular kind of formatting.

Testing

When B::Deparse fails, it’s can take a little bit of study to understand what it is talking about:

Here is an example:

Failed test 'Constants in a block' at t/20-deparse-small.t line 114. { '??'; 2 } }' doesn't match '(?^:^\s*\{\s*\s+\{\s+\'\?\?\?\'\;\s+2\s+)'

Is this clear to you what’s wrong? Line 114 is the line number of code that is reading data. What you really want to know is the line number of the data that it is reading. The best it does is give ‘Constants in a block’ so you can search for that.

I have split out the test data from the code so that I can reuse the code for different sets of data. For example I can run data common to all Perl versions then data specific a particular version of Perl, or data that covers some smaller but related set of issues, like those around subroutine and method calling.

The way I currently address the above is to show in addition to the above message, a diff of two compared texts.

Visually, it is easily to ignore differences in spacing to see what’s wrong.

Furthermore, when I hit an error the test program that failed is written out to disk. This way that test can be run in isolation to the other tests.

What is further needed though is to start grouping simple tests that trigger the use of a B::Deparse function and to split that those tests into a separate data file. For example there might be a test data file of tests for the method binop which handles for binary operators, another test data file for listop, which handles the list-like operators and so on.

Round trip tests

I have ameliorated somewhat of the difficulty in figuring out what went wrong when a test fails by improving the error message and writing out the test. However, there still is the problem that the tests are frail, even though a regular expression is used in comparison. We have this conundrum: if you want something that a person can easily detect the difference between the texts, you would compare using a string or a very simple regular expressions. But as you move to making tests less fragile and less subject to the whim of how B::Deparse format Perls, you move onto more complicated regular expressions. But this is harder to suss when there is a difference. So how do we do better?

We can avoid all of the frailty associated comparisons or pattern matching by doing round trip testing. In the Perl source code distribution there already are a number of Perl programs that check themselves when run. These are in Perl’s t directory. Some files in that are t/base/cond.t, or t/base.if.t.

So all that is needed is have B::Deparse compile and decompile these programs (the somewhat magical invocation is perl -MO=Deparse,sC ...), and write the decompiled result to file. When we can then run Perl on the decompiled code and Perl will indicate whether the result is invalid. Edsgar Dijkstra’s quip about a test not proving correctness, but only demonstrating failure applies here. If the code doesn’t fail that doesn’t mean that it was deparsed correctly. Just that the running the program couldn’t find an error.

But when there is an error, Perl’s error message is often helpful in suggesting what decompiled incorrectly, especially when the error is a problem in Perl syntax. Even when the error is not a syntax error, the the decompilation error and the run-time error are generally close.

By inspecting the resulting file, I can usually see what’s wrong. And I have original source to compare against if the problem was not apparent.

The need for modularity

Tracking changes across Perl Versions

Currently B::Deparse is bundled with Perl and that means that the code that comes with Perl only needs to be concerned with that version.

Although it is true that you’ll find tests on the Perl versions like this:

$self->{'hints'} &= 0xFF if $] < 5.009;

in reality the code generally cannot be used by from another major Perl version release.

Here are some of the error messages I got when I tried to use the Perl 5.26.2 version of B::Deparse from Perl 5.24.4:

"OPpSPLIT_ASSIGN" is not exported by the B module
"OPpSPLIT_LEX" is not exported by the B module

So how would you understand how the OP tree has changed between in 5.24.4 and 5.26.2 that requires changes in the way B::Deparse works? Well, you could use either a file or git diff between two files in each of the released versions. And/or you could use that with git commits or ChangeLog entries. Depending on what changed, this might not be so bad, but generally I find it tedious.

What you want is to do is separate each change into one of three categories:

  1. Bug fixes in the newer version would also be beneficial in the older version
  2. Nonfunctional, but stylistic changes
  3. Changes that reflect real changes between versions and so the two sets of code need be kept separately

If the program were reorganized and modular, changes between versions and why would be more apparent. As an example of how this might look in Perl, I’ll show how I’ve done this in Python. Suppose you want to now which opcodes are different between Python 3.4 and 3.5. Well, look at this.

And how does that effect decompilation? That’s a harder question to answer simply, but that too to some degree of success has been isolated in code. For the semantic tree interpretation routines changes between versions are isolated here.

The how-to-extend problem?

I wanted to extend B::Deparse so I can use it at run time to determine my position. So how do I do this?

Given that B::Deparse isn’t all modular and is a single file, I started out in the most obvious way by copying the file and modifying it. Given my lack of understanding of how B::Deparse worked, this was extremely expedient and probably unavoidable. However very quickly I realized that it just doesn’t scale, and that I’d have to modularize the code. have spent a good deal time trying to refactor the code at least in the context of my new feature.

I’m close to having this finished. This code could be used as the basis for a rewritten B::Deparse.

Next are some cool features of the new code that could be used in B::Deparse.

Table-driven opnodes

In trying to use and modularize this code, I see there is a lot of repetition in subroutine parsing routines.

Compare this:

sub pp_die { listop(@_, "die") }
sub pp_each { unop(@_, "each") }
sub pp_egrent { baseop(@_, "endgrent") }

with

'die'        => 'listop',
'each'       => 'unop',
'egrent'     => ['baseop', 'endgrent'],

Was it obvious to you when looking at the subroutine call, that the name “egrent” got converted to “endgrent” which didn’t happen in other shown entries?

Having mappings from PP opnode to function makes it easier to customize the entries as required as the Perl version varies. In some versions, some ops we need to surround an the text from an operation with a “my”, while in other versions that is not the case. It is clearer and simpler just to change table entries than it is to muck with OO lookup or doing subroutine assignments.

Format-spec driven fragment creation

How B::Deparse works is conceptually simple: it walks the optree building and combining string fragments at node from the nodes’ children. When you return from root or top node after walking the tree, you have a string representing the entire function or module that the root represents.

However understanding what goes on inside a given node, is very haphazard. So if you have a bug somewhere, figuring out where there was a problem and why is difficult.

This kind of thing you may have already encountered in a different guise, and a good solution to that problem applies here. Imagine you are trying to create a nicely formatted report with lots of data values. You can try combining strings together interspersed with calls to conversion routines to different data items.

Instead many developers use some sort of format specifiers in a template. That’s a good solution here.

Swapping out Node and Syntax tree for String-oriented routines.

The focus of code right now has been for handling this new fragment deparsing feature. However I believe B::DeparseTree would be a good replacement for B::Deparse because of its modularity, and ease with which you can see what’s going on. The template specifiers assists separating traversing code from building string, however it should be noted that B::Deparse could have done this too, simply by using sprintf more often.

Exact Perl location with B::DeparseTree (and Devel::Callsite)

Recently I have been working on this cool idea: using B::Deparse to help me figure out exactly where a program is stopped. This can be used in a backtrace such as when a program crashes from Carp::Confess or in a debugger like Devel::Trepan.

To motivate the idea a little bit, suppose my program has either of these lines:

$x = $a/$b + $c/$d;
($y, $z) = ($e/$f, $g/$h);

I might want to know which division in the line is giving me an illegal division by zero.

Or suppose you see are stopped in a Perl statement like this:

my @x = grep {$_ =~ /^M/} @list;

where exactly are you stopped? And would the places you are stopped at be different if this were written:

my @x = grep /^M/, @list;

? (The answer is yes.)

A while back with the help of perlmonks, the idea of using the OP address was the only promising avenue. More recently, I re-discovered B::Deparse and realized it might be able to do the rest: give the context around a specific op-code location. Devel::Callsite can be used to get your current op-code address.

B::Deparse is one of those things like the venerable perl debugger:

  • It is a brute-force effort with a long history,
  • many people have contributed to it,
  • it is one huge file.

It has been said that nothing can parse Perl other than Perl. Well, nothing can de-parse Perl's OP's other than B::Deparse. It understands the Perl interpreter and its intricacies very well.

But the most important feature I need is that B::Deparse has a way of doing its magic inside a running program. You can give it a subroutine reference at runtime and it will deparse that.

A useful side benefit in B::Deparse's output is that it will split up multi-statement lines into one line per statement.

So:

  $x = 1; $y *= 2; $z = $x + $y;

will appear as:

  $x = 1;
  $y *= 2;
  $z = $x + $y;

All good so far. The first piece of bad news is that it doesn't show the OP addresses. But that is pretty easily remedied.

Initially I figured I'd handle this the way I did when I wanted to show fragments of disassembly code colorized using B::Concise: I'd just dump everything to a buffer internally and then run some sort of text filtering process to get the part I wanted.

So I monkey-patched and extended B::Deparse so I could search for an op address and it would return the closest COP, and I show that statement. This was released in version 0.70 of Devel::Trepan.

This is a hack though. It isn't really what I wanted. While showing just the addresses at COP or statement boundaries helps out with multiple statements per line, it isn't all that helpful otherwise. In the first example with dividing by zero or an inside a parallel assignment, there would just be to COP addresses and that's really no better than giving a line number. I need to add information about sub parts inside a statement.

So the next idea was to extend B::Deparse to store a hash of addresses (a number) to B:OPs. Better. But not good enough. I still would need to do the part that B::Deparse does best: deparsing.

Also, I want to have a way to easily go up the OP tree to get larger and larger context. For example, suppose the code is:

 $x = shift; $y = shift;

and I report you are stopped at "shift". I would probably want to say: Give me the full statement that the "shift" is part of. This means in the OP tree I would want the parent. Although there is a way to compile Perl storing parent pointers, Perl generally isn't built that way. Given an OP address, I'm not sure how we could easily find its parent other than starting from the top and traversing.

So my current tack is sort of an abstract OP tree which stores text fragments for that node in the tree. As it walks the tree top down it saves parent pointers to the nodes it creates.

You may ask, what's the difference between this and the OP tree other than the parent pointer?

Well, recall that B::Deparse has already abstracted the OP codes, from a lower level form into higher level constructs. This is true more so as we move up the first couple levels of the tree. The Perl output is generic and dumb, but still it is slightly at at higher level than the sequence of OP instructions.

Saving more of the tree structure can improve deparsing itself.

Right now B::Deparse walks the tree and builds Perl code expressions and statements bottom up. The main thing passed down right now is operator precedence to reduce the extraneous parentheses. At level in the OP tree, the only information from the children passed up is the result string.

In my B::DeparseTree, in addition to the text fragments, I keep child information in a more structured way, and a parent pointer is saved and is available during processing. The parent pointer is useful in showing larger context described below. Each node notes whether parenthesis were needed when combined at the next level, so that they can be omitted when displaying the fragment starting at that node.

I close with some observations in using this. My first test was with fibonacci:

sub fib($) {
   my $x = shift;
   return 1 if $x <= 1;
   return fib($x-1) + fib($x-2);
  }

If you deparse stopped in a debugger in the line with my $x = shift, you get:

 shift()  # which is inside..
 my $x = shift()

So far so good. Stepping to the next stopping point inside the line with return 1 if $x <= 1 you get:

 $x # which is inside...
 $x <= 1

Still good. Things start get interesting when I do another step into return fib($x-1) + fib($x-2); Deparsing, as I originally had it, did not find anything. Here's why:

-- main::(example/fib.pl:11 @0x221dce8)
return(fib($x-1) + fib($x-2))
(trepanpl): deparse
# Nothing
(trepanpl): disasm -terse
Subroutine main::fib
-------------------
 main::fib:
UNOP (0x221dc40) leavesub [1]
    LISTOP (0x21f9608) lineseq
#9:     my $x = shift;
        COP (0x21f9650) dbstate
        BINOP (0x21f96b0) sassign
            OP (0x21f96f8) shift
            OP (0x21f9730) padsv [1]
#10:     return 1 if $x <= 1;
        COP (0x2227e98) dbstate
        UNOP (0x2227ef8) null
            LOGOP (0x2227f38) and
                BINOP (0x2227f80) le
                    OP (0x2228008) padsv [1]
                    SVOP (0x2227fc8) const  IV (0x4d25160) 1
                LISTOP (0x2228040) return
                    OP (0x21f9590) pushmark
                    SVOP (0x21f95c8) const  IV (0x4d25238) 1
 #11:     return(fib($x-1) + fib($x-2))
        COP (0x221dc88) dbstate
        LISTOP (0x221dd20) return
 =>              OP (0x221dce8) pushmark
            BINOP (0x221dd68) add [6]
                UNOP (0x221dfb8) entersub [3]
                    UNOP (0x2227d00) null [149]
                        OP (0x2227cb0) pushmark
                        BINOP (0x2227d48) subtract [2]
                            OP (0x2227e10) padsv [1]
                            SVOP (0x2227d90) const  IV (0x4d24f38) 1
                        UNOP (0x2227dd0) null [17]
                            SVOP (0x2227e50) gv  GV (0x4d03b28) *fib
                UNOP (0x221ddb0) entersub [5]
                    UNOP (0x221de28) null [149]
                        OP (0x221ddf0) pushmark
                        BINOP (0x221de70) subtract [4]
                            OP (0x221df38) padsv [1]
                            SVOP (0x221deb8) const  IV (0x4d24e30) 2
                        UNOP (0x221def8) null [17]
                            SVOP (0x221df78) gv  GV (0x4d03b28) *fib

The next instruction to be executed is a pushmark, and B::Deparse skips that when it procesess the LISTOP. My remedy here was to note in the structure other ops underneath that are "skipped" or subsumed in the parent operation.

After fixing this the output is:

return (fib($x - 1) + fib($x - 2)) # part of...
sub fib($) {
   # line 9 'example/fib.pl'
   # ... rest of fib code

Stepping recursively into fib you get the last weirdness I encountered. Here is Devel::Trepan output so I can describe the situation better:

trepan.pl example/fib.pl
-- main::(example/fib.pl:14 @0x21798a8)
printf "fib(2)= %d, fib(3) = %d, fib(4) = %d\n", fib(2), fib(3), fib(4);
set auto eval is on.
(trepanpl): b 9   # first statement in fib
Breakpoint 1 set in example/fib.pl at line 9
(trepanpl): continue
xx main::(example/fib.pl:9 @0x217d268)
  my $x = shift;
(trepanpl): continue  # first recursive call
xx main::(example/fib.pl:9 @0x217d268)
   my $x = shift;
(trepanpl): up
--> #1 0x221ddf0 $ = main::fib(2) in file `example/fib.pl' at line 11
 main::(example/fib.pl:11 @0x221ddf0)
 return(fib($x-1) + fib($x-2))
(trepanpl): deparse
fib($x - 2) # part of...
fib($x - 1) + fib($x - 2)
(trepanpl):

I'm in fib($x-2)? No, I'm in the middle of evaluating fib($x-1)! What's going on?

The stopping location is really the point where I would continue. (It is the "pushmark" at address 0x221ddf0 in the listing above; this is just before subtacting 2.) So fib($x-2) what I would next execute after returning. To reinforce this, when I step an invocation from fib($x-2) and do the same thing, I now see:

fib($x - 1) + fib($x - 2) # part of
return (fib($x - 1) + fib($x - 2))

Which is saying I am stopped before the final addition, just before the final return. A possible fix is to step back OPs to find the call. I dunno. What do you all think?

In sum, this is all pretty powerful stuff. It's also a lot of work.

A Basic Challenge

Recently, I came across this project which turns C++ into BASIC. How close can Perl do?

I recall reading somewhere that Perl has the ability to vastly alter its syntax. Is Perl going to be bested by C++?

Introspection in Devel::Trepan

Here are some introspection routines in Devel::Trepan. I’m not aware that these exist in other debuggers, nor as Devel::REPL plugins. But if I’m wrong feel free to correct me in comments. And feel free to take code from Devel::Trepan to rework elsewhere.

Recently Jeffrey Ryan Thalhammer asked about variable, and subroutine completion and this got me thinking.

Info functions and Info packages

When he asked, there was a debugger command info functions which listed functions matching some regular expression. (There is also a gdb command of the same name.) That command accepted both fully-qualified and unqualified function names.

It occurred to me that I could also add an adjunct to that command, info packages. This command takes the data used in info functions, but it reindexes the fully-qualified function names keyed by package.

So here are some examples:

 (trepanpl): info packages Tie::   
 Tie::ExtraHash    Tie::Hash    Tie::StdHash

 (trepanpl): info packages -s -f Tie::Hash
 Tie::Hash
   BEGIN    CLEAR    EXISTS    TIEHASH    new 
 Tie::Hash is in file /usr/share/perl/5.14/Tie/Hash.pm

 (trepanpl): info functions Tie::Hash::new
 /usr/share/perl/5.14/Tie/Hash.pm
     Tie::Hash::new is at 8-11
 (trepanpl): list Tie::Hash::new  
 /usr/share/perl/5.14.2/Tie/Hash.pm [4-13]
 4      
 5      use Carp;
 6      use warnings::register;
 7      
 8      sub new {
 9          my $pkg = shift;
10          $pkg->TIEHASH(@_);
11      }
12      
13      # Grandfather "new"

For lexical variables, there is also info variables lexicals.

Completion and Complete

Well, now that that’s there, what about subroutine, package and filename completion on those above commands? That’s in there too. But wait, there’s more! Being something of a geek, I created internal routines for those types of completions. There are also specialized completion routines for my and our variables.

But to give you access to those from the debugger command line, I extended the debugger command complete. This is also a gdb command, although those specific options are not in gdb’s version. That format of giving options, though, is how gdb would do something like this.

The intent in adding that to the debugger command was not just for interactive use, but to provide something that more sophisticated front-ends could use. Or at least show the way.

In fact, the out-of-process client, trepan.pl --client does use this debugger “complete” command under the covers in its readline completion.

So although I don’t have a complete completion package when you are typing an expression, a number of the underlying components are there should someone want to take this and extend it.

Deparse

In Perl Tricks, I read about B::Deparse. I found that so interesting and useful that, starting with release 0.57, it has added to Devel::Trepan as debugger command deparse. And we can even use Perl::Syntax::Highlight to colorize the output.

Disassembly

At the opposite end of the spectrum, there is a Devel::Trepan plugin to disassemble Perl code. Right now disassembly is at the file or subroutine level. Since I have the actual stopped OP address via Devel::Callsite, it would be nice to be able to allow a narrower range. However I haven’t figured out exactly how to do that even though I have some hints. See Getting a B::OP from an address?

And finally on this topic of low-level information, I should mention David Golden’s suggestion for hooking into Devel::Peek. For more information and a workaround, see this issue.

Conclusion

Here, I introduced you into some of the introspection aspects of Devel::Trepan. By the way, all of pod documentation for debugger commands given above, can also be gotten inside the debugger itself, with its help command.

In a sense there’s nothing here that really isn’t in Perl itself. One can think of the debugger commands and internal routines, merely as a wrapper around existing modules and existing Perl features. (That’s where all of the real heavy lifting is done.) Given that, there is no reason why this couldn’t be added to other debuggers, or REPLs, if it isn’t there already.

Go forth and multiply! There is more than one way to do it.

RFC: Term::ReadKey Availability and requiring it in Term::ReadLine::Perl5

A while back I forked Term::ReadLine::Perl making Term::ReadLine::Perl5 because of the former maintainer's lack of responsiveness regrading my patch to add GNU Readline history and general lack of responsiveness overall.

Term::ReadLine::Perl purports to be a "pure Perl" version of GNU ReadLine. It can use, but does not require, Term::ReadKey. With this issue it seems that more hacking is needed when Term::ReadKey is not available.

Right now Term::ReadKey is recomme…