Binary parsing with Marpa: .class files
Introduction
It sounds quite natural that any parser should (must) not care if the input is truely a character or not: the important thing is to get input parsed, whatever its meaning. Therefore, an interesting exercise was to get in touch with that using the Marpa::R2 parser.
If the binary data structure is well documented (we have to trust a technical specification at some stage -;), we expect to always be able to write a working BNF from it. A extremely wide-spreaded language such as Java is a very good candidate, and indeed its document on the .class file format appears to be excellent.
So let's talk a bit about MarpaX-Java-ClassFile.
.class file parsing
.class binary gems
Instead of writing a single big grammar, every individual "unit" of a class file has been translated into a dedicated parse package (MarpaX::Java::ClassFile::BNF::x
), and the parsing value of the later goes into a dedicated value package (MarpaX::Java::ClassFile::Struct::x
). Pro is high modularity, where every unit is independant, con is the load-time of all modules at the perl's BEGIN phase -;
The input is assumed to be a Bytes stream, we write easily the "elementary" units, Marpa call them "lexemes", using this quote from the specification:
- A class file consists of a stream of 8-bit bytes. All 16-bit, 32-bit, and 64-bit quantities are constructed by reading in two, four, and eight consecutive 8-bit bytes, respectively. Multibyte data items are always stored in big-endian order, where the high bytes come first.
This is translated into this BNF (there is more than way to write it -;), comments should be self-explanatory:
# # latm is a sane default that can be common to all grammars: it means # take the longest acceptable match # lexeme default = latm => 1 _U1 ~ [\s\S] # Match any character ahem byte (binmode) _U2 ~ _U1 _U1 # twice _U4 ~ _U2 _U2 # twice again U1 ~ _U1 # One byte U2 ~ _U2 # Two bytes U4 ~ _U4 # Four bytes
A grammar using these lexemes would do:
u1 ::= U1 action => u1 # unsigned - see below signedU1 ::= U1 action => signedU1 # signed! see below u2 ::= U2 action => u2 # unsigned - see below ./.. and so on
The endianness order determine the meaning associated to a sequence of bytes. Therefore it belongs to the parse tree value: this is the return value from the functions mentionned just after the 'action =>' keyword, listed thereafter. The java specification does not tell at the very beginning if values are signed or not, this is the reason of these "signedUx" functions below:
# # Actions are always called like this: action($self, @rhs) # where $self is a FREE CONTEXT. Not necessarily a blessed # object. # sub u1 { unpack('C', $_[1]) } sub signedU1 { unpack('c', $_[1]) } sub u2 { unpack('n', $_[1]) } ./.. and so on
Note: quadratic thingies are done using Bit::Vector for portability reasons.
Parsing of the .class file
Let's take the global structure definition:
ClassFile { u4 magic; u2 minor_version; u2 major_version; u2 constant_pool_count; cp_info constant_pool[constant_pool_count-1]; u2 access_flags; u2 this_class; u2 super_class; u2 interfaces_count; u2 interfaces[interfaces_count]; u2 fields_count; field_info fields[fields_count]; u2 methods_count; method_info methods[methods_count]; u2 attributes_count; attribute_info attributes[attributes_count]; }
We see it consist of the "gems" u1
, u2
and composite sub-structures cp_info
, field_info
, method_info
and attribute_info
. We said that every "unit" is writen with its specific grammar, and all of them are independant. This mean that these sub-structures become opaque things to our grammar. This is why an OPAQUE
lexeme is added to the gems, i.e.:
_MANAGED ~ [^\s\S] # Match NOTHING MANAGED ~ _MANAGED # This is entirely in the responsibility of the caller
Why match nothing ? Because a generic grammar must not know anything it. Just that it exist; and we simply do not want Marpa to automatically match something that would belong to an opaque thingy. In conclusion, if the layer on top of the grammar is not able to tell in advance to Marpa what it is, Marpa will naturally croak because nothing can be matched.
What about the BNF of the ClassFile, then?
event 'constant_pool_count$' = completed constant_pool_count event 'interfaces_count$' = completed interfaces_count event 'fields_count$' = completed fields_count event 'methods_count$' = completed methods_count event 'attributes_count$' = completed attributes_count ClassFile ::= magic minor_version major_version constant_pool_count constant_pool access_flags this_class super_class interfaces_count interfaces fields_count fields methods_count methods attributes_count attributes action => _ClassFile magic ::= U4 action => u4 minor_version ::= U2 action => u2 major_version ::= U2 action => u2 constant_pool_count ::= U2 action => u2 constant_pool ::= MANAGED action => ::first access_flags ::= U2 action => u2 this_class ::= U2 action => u2 super_class ::= U2 action => u2 interfaces_count ::= U2 action => u2 interfaces ::= MANAGED action => ::first fields_count ::= U2 action => u2 fields ::= MANAGED action => ::first methods_count ::= U2 action => u2 methods ::= MANAGED action => ::first attributes_count ::= U2 action => u2 attributes ::= MANAGED action => ::first
As you can see, it matches exactly the structure definition, and everything that cannot be described only in terms of U1
, U2
or U4
is systematically translated to something opaque. The ::first
action is just a handy Marpa shortcut, strictly equivalent to
sub __first { $_[1] }
though presumably faster because built-in-;
Actions u2
and u4
just mean: this is an unsigned value.
Action _ClassFile is in the userspace of the module, and is nothing else but taking all the symbol values on the right-hand side of the ClassFile symbol:
sub _ClassFile { my ($self, $magic, $minor_version, $major_version, $constant_pool_count, $constant_pool, $access_flags, $this_class, $super_class, $interfaces_count, $interfaces, $fields_count, $fields, $methods_count, $methods, $attributes_count, $attributes) = @_; MarpaX::Java::ClassFile::Struct::ClassFile->new( magic => $magic, minor_version => $minor_version, major_version => $major_version, constant_pool_count => $constant_pool_count, constant_pool => $constant_pool_count, access_flags => $access_flags, this_class => $this_class, super_class => $super_class, interfaces_count => $interfaces_count, interfaces => $interfaces, fields_count => $fields_count, fields => $fields, methods_count => $methods_count, methods => $methods, attributes_count => $attributes_count, attributes => $attributes ) }
et voila! this is already (almost -;) done! Remains the opaque thingies... You have seen of course these 'event' keywords ? For example:
event 'interfaces_count$' = completed interfaces_count
and it really means what is says: when the symbol interfaces_count
is completed, Marpa will generated an event that is named interfaces_count$
. Up to the caller to do what is necessary. The trick is then simply to call the corresponding sub-grammar, and inject its return value by hand under the OPAQUE
lexeme, that Marpa by itself will never match:
"Marpa Recognizer Instance"->lexeme_read('OPAQUE', $start, $length, $theOpaqueParseTreeValue);
Application: javapp command-line
Usually it is appreciated that a command-line exist, just to show that it is not only a proof of concept. Therefore, a javapp command-line tool is provided. This is nothing else but a perl's version of any SDK's javap distribution. Its output is not aimed to be compatible with it, and it is printing the whole stringified version of the Marpa parse tree value of a .class file parsing.
Stringification itself was another enjoying thingy in this package, and is accomplished entirely with a highly recursive ""
operator overloading, but that is another story -;
For example, just do
javapp /location/of/a/classFile.class
or...
javapp /location/of/a/jarFile.jar
or... let is search for both
javapp /location/of/
Conclusion
it has already been said, but truely I mean it: Marpa::R2 is great, we invite anybody interesting in trying it. There is a Marpa google group and an IRC channel.
Leave a comment