Binary parsing with Marpa: .class files

Introduction

Introduction

It sounds quite natural that any parser should (must) not care if the input is truely a character or not: the important thing is to get input parsed, whatever its meaning. Therefore, an interesting exercise was to get in touch with that using the Marpa::R2 parser.

If the binary data structure is well documented (we have to trust a technical specification at some stage -;), we expect to always be able to write a working BNF from it. A extremely wide-spreaded language such as Java is a very good candidate, and indeed its document on the .class file format appears to be excellent.

So let's talk a bit about MarpaX-Java-ClassFile.

.class file parsing

.class binary gems

Instead of writing a single big grammar, every individual "unit" of a class file has been translated into a dedicated parse package (MarpaX::Java::ClassFile::BNF::x), and the parsing value of the later goes into a dedicated value package (MarpaX::Java::ClassFile::Struct::x). Pro is high modularity, where every unit is independant, con is the load-time of all modules at the perl's BEGIN phase -;

The input is assumed to be a Bytes stream, we write easily the "elementary" units, Marpa call them "lexemes", using this quote from the specification:

  • A class file consists of a stream of 8-bit bytes. All 16-bit, 32-bit, and 64-bit quantities are constructed by reading in two, four, and eight consecutive 8-bit bytes, respectively. Multibyte data items are always stored in big-endian order, where the high bytes come first.

This is translated into this BNF (there is more than way to write it -;), comments should be self-explanatory:

    #
    # latm is a sane default that can be common to all grammars: it means
    # take the longest acceptable match
    #
    lexeme default = latm => 1

    _U1      ~ [\s\S]   # Match any character ahem byte (binmode)
    _U2      ~ _U1 _U1  # twice
    _U4      ~ _U2 _U2  # twice again

    U1      ~ _U1       # One byte
    U2      ~ _U2       # Two bytes
    U4      ~ _U4       # Four bytes

A grammar using these lexemes would do:

    u1       ::= U1     action => u1        # unsigned - see below
    signedU1 ::= U1     action => signedU1  # signed! see below
    u2       ::= U2     action => u2        # unsigned - see below
    ./.. and so on

The endianness order determine the meaning associated to a sequence of bytes. Therefore it belongs to the parse tree value: this is the return value from the functions mentionned just after the 'action =>' keyword, listed thereafter. The java specification does not tell at the very beginning if values are signed or not, this is the reason of these "signedUx" functions below:

    #
    # Actions are always called like this: action($self, @rhs)
    # where $self is a FREE CONTEXT. Not necessarily a blessed
    # object.
    #
    sub u1       { unpack('C', $_[1]) }
    sub signedU1 { unpack('c', $_[1]) }
    sub u2       { unpack('n', $_[1]) }
    ./.. and so on

Note: quadratic thingies are done using Bit::Vector for portability reasons.

Parsing of the .class file

Let's take the global structure definition:

 ClassFile {
    u4             magic;
    u2             minor_version;
    u2             major_version;
    u2             constant_pool_count;
    cp_info        constant_pool[constant_pool_count-1];
    u2             access_flags;
    u2             this_class;
    u2             super_class;
    u2             interfaces_count;
    u2             interfaces[interfaces_count];
    u2             fields_count;
    field_info     fields[fields_count];
    u2             methods_count;
    method_info    methods[methods_count];
    u2             attributes_count;
    attribute_info attributes[attributes_count];
 }

We see it consist of the "gems" u1, u2 and composite sub-structures cp_info, field_info, method_info and attribute_info. We said that every "unit" is writen with its specific grammar, and all of them are independant. This mean that these sub-structures become opaque things to our grammar. This is why an OPAQUE lexeme is added to the gems, i.e.:

    _MANAGED ~ [^\s\S]       # Match NOTHING
    MANAGED ~ _MANAGED       # This is entirely in the responsibility of the caller

Why match nothing ? Because a generic grammar must not know anything it. Just that it exist; and we simply do not want Marpa to automatically match something that would belong to an opaque thingy. In conclusion, if the layer on top of the grammar is not able to tell in advance to Marpa what it is, Marpa will naturally croak because nothing can be matched.

What about the BNF of the ClassFile, then?

  event 'constant_pool_count$' = completed constant_pool_count
  event 'interfaces_count$'    = completed interfaces_count
  event 'fields_count$'        = completed fields_count
  event 'methods_count$'       = completed methods_count
  event 'attributes_count$'    = completed attributes_count
  ClassFile ::=
             magic
             minor_version
             major_version
             constant_pool_count
             constant_pool
             access_flags
             this_class
             super_class
             interfaces_count
             interfaces
             fields_count
             fields
             methods_count
             methods
             attributes_count
             attributes
  action => _ClassFile

  magic                ::= U4          action => u4
  minor_version        ::= U2          action => u2
  major_version        ::= U2          action => u2
  constant_pool_count  ::= U2          action => u2
  constant_pool        ::= MANAGED     action => ::first
  access_flags         ::= U2          action => u2
  this_class           ::= U2          action => u2
  super_class          ::= U2          action => u2
  interfaces_count     ::= U2          action => u2
  interfaces           ::= MANAGED     action => ::first
  fields_count         ::= U2          action => u2
  fields               ::= MANAGED     action => ::first
  methods_count        ::= U2          action => u2
  methods              ::= MANAGED     action => ::first
  attributes_count     ::= U2          action => u2
  attributes           ::= MANAGED     action => ::first

As you can see, it matches exactly the structure definition, and everything that cannot be described only in terms of U1, U2 or U4 is systematically translated to something opaque. The ::first action is just a handy Marpa shortcut, strictly equivalent to

  sub __first { $_[1] }

though presumably faster because built-in-;

Actions u2 and u4 just mean: this is an unsigned value.

Action _ClassFile is in the userspace of the module, and is nothing else but taking all the symbol values on the right-hand side of the ClassFile symbol:

 sub _ClassFile {
  my ($self,
      $magic,
      $minor_version,
      $major_version,
      $constant_pool_count,
      $constant_pool,
      $access_flags,
      $this_class,
      $super_class,
      $interfaces_count,
      $interfaces,
      $fields_count,
      $fields,
      $methods_count,
      $methods,
      $attributes_count,
      $attributes) = @_;

  MarpaX::Java::ClassFile::Struct::ClassFile->new(
    magic               => $magic,
    minor_version       => $minor_version,
    major_version       => $major_version,
    constant_pool_count => $constant_pool_count,
    constant_pool       => $constant_pool_count,
    access_flags        => $access_flags,
    this_class          => $this_class,
    super_class         => $super_class,
    interfaces_count    => $interfaces_count,
    interfaces          => $interfaces,
    fields_count        => $fields_count,
    fields              => $fields,
    methods_count       => $methods_count,
    methods             => $methods,
    attributes_count    => $attributes_count,
    attributes          => $attributes
  )
 }

et voila! this is already (almost -;) done! Remains the opaque thingies... You have seen of course these 'event' keywords ? For example:

  event 'interfaces_count$' = completed interfaces_count

and it really means what is says: when the symbol interfaces_count is completed, Marpa will generated an event that is named interfaces_count$. Up to the caller to do what is necessary. The trick is then simply to call the corresponding sub-grammar, and inject its return value by hand under the OPAQUE lexeme, that Marpa by itself will never match:

  "Marpa Recognizer Instance"->lexeme_read('OPAQUE',
                                           $start, $length,
                                           $theOpaqueParseTreeValue);

Application: javapp command-line

Usually it is appreciated that a command-line exist, just to show that it is not only a proof of concept. Therefore, a javapp command-line tool is provided. This is nothing else but a perl's version of any SDK's javap distribution. Its output is not aimed to be compatible with it, and it is printing the whole stringified version of the Marpa parse tree value of a .class file parsing.

Stringification itself was another enjoying thingy in this package, and is accomplished entirely with a highly recursive "" operator overloading, but that is another story -;

For example, just do

    javapp /location/of/a/classFile.class

or...

    javapp /location/of/a/jarFile.jar

or... let is search for both

    javapp /location/of/

Conclusion

it has already been said, but truely I mean it: Marpa::R2 is great, we invite anybody interesting in trying it. There is a Marpa google group and an IRC channel.

Leave a comment

About Jean-Damien Durand

user-pic About::Me::And::Perl