A Marpa-powered C parser

Jean-Damien Durand has just released MarpaX::Languages::C::AST, which parses C language into an abstract syntax tree (AST). MarpaX::Languages::C::AST has been tested against Perl's C source code, as well as Marpa's own C source.

[ This is cross-posted from its home on the Ocean of Awareness blog. ]

Because it is Marpa-powered, MarpaX::Languages::C::AST works differently from other C parsers. In the past, C parsers have been syntax-driven -- parsing was based on a BNF description of the C grammar. More recently, C parsers have used hand-written recursive descent -- they have been procedurally-driven.

MarpaX::Languages::C::AST uses both approaches. Marpa has the advantage that it makes full knowledge of the state of the parse available to the programmer, so that procedural logic and syntax-driven parsing can reinforce each other. The result is a combined lexer/parser which is very compact and easy to understand. Among the potential applications:

  • Customized "lints". You can write programs to enforce C language standards and restrictions specific to an individual, a company or a project.
  • C interpreters. By taking the AST and adding your own back end, you can create a special-purpose C interpreter or a special-purpose compiler.
  • C variants. Because the code for the parser is compact and easy to modify, it lends itself to language extension and experimentation. For example, you could reasonably implement compilers to try out the proposals submitted to a standards committee.
  • C supersets. Would you like to see some of the syntax from a favorite language available in C? Here's your chance.

The implementation

A few of Jean-Damien's implementation choices are worth noting. A C parser can take one of two strategies: approximate or precise. A compiler has, of course, to be precise. Tools, such as cross-referencers, often decide to be approximate, or sloppy. Sloppiness is easier to implement and has other advantages: A sloppy tool can tolerate missing C flags: what the C flags should be can be one of the things it guesses at.

Of the two strategies, Jean-Damien decided to go with "precise". MarpaX::Languages::C::AST follows the C11 standard, with either GCC or Microsoft extensions. This has the advantage that MarpaX::Languages::C::AST could be used as the front end of a compiler.

Because MarpaX::Languages::C::AST purpose is to take things as far as an AST, and let the user take over, it does not implement those constraints usually implemented in post-processing. One example of a post-syntactic constraint is the one that bans "case" labels outside of switch statements. Perhaps a future version can include a default "first phase" post-processor to enforce the constraints from the standards. As currently implemented, the user can check for and enforce these constraints in any way he likes. This makes it easier for extensions and customizations, which I think of as the major purpose of MarpaX::Languages::C::AST.

The parsing strategy

Those familar with the C parsing and its special issues may be interested in Jean-Damien's approach to them. MarpaX::Languages::C::AST is, with a few exceptions, syntax-driven -- the parser works from Marpa's SLIF, an extended BNF variant. The SLIF-driven logic is sufficient to deal with the if-then-else issue. Marpa handles right recursion in linear time, so that the if-then-else issue could have been dealt with by rewriting the relevant rules. But Jean-Damien wanted to have his BNF follow closely the grammar in the standards, and he decided to use Marpa's rule ranking facility instead.

More complicated is the ambiguity in C between variable names and types, which actually takes C beyond BNF and context-free grammars into context-sensitive territory. Most C parsers deal with this using lexer or post-processing hacks. Marpa allows the parser to do this more elegantly. Marpa knows the parsing context at all times and can comnunicate this to a user's customized code. Marpa also has the ability to use the parsing context to decide when to switch control from the syntax-driven logic to a user's customized procedural logic, and for the syntax-driven logic to take control back when the procedural logic wants to give it back. This allows the variable-name-versus-type ambiguity to be handled by specifically targeted code which knows the full context of the decisions it needs to make. This code can be written more directly, simply and clearly than was possible with previous parsing methods.

Compilers?

Above I mentioned special-purpose compilers. What about production compilers? MarpaX::Languages::C::AST's upper layers are in Perl, so the speed, while acceptable for special-purpose tools, will probably not be adequate for production. Perhaps a future Marpa-powered C parser will rewrite those upper layers in C, and make the race more interesting.

To learn more

Marpa::R2 is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa also has a web page. For questions, support and discussion, there is the "marpa parser" Google Group. Comments on this post can be made there.

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.