XML::Invisible - writing parsers without pain or code

Version 0.03 of XML::Invisible is now on CPAN. This lets you write parsers that produce XML-like Abstract Syntax Trees (AST), or actual XML documents, without writing any code. Why did I write it?

Parsing: a tiny introduction

Parsing is turning a text input, into semantically valuable output. It is often broken into the stages of lexing (turning the initial text into tokens - errors detected if invalid tokens given), parsing (structuring those tokens into ASTs - errors if structure wrong), later processing (doing something with the AST).

There are a number of ways of writing parsers generally. The most maintainable way is using as much of a declarative style as possible, usually by writing a grammar. There are various options in Perl, including Marpa, Parse::RecDescent, and Pegex. For each of these, you have to write a grammar (obviously), and write some code to handle the text inputs and parsing results.

Side note: I believe that completely separating the parsing process from semantically operating on the AST is good engineering in almost all circumstances. The only exception I have seen so far is the parsing example often used to compare parsing frameworks: data transformations like parsing JSON. For these, there is not a great deal of point in constructing an AST, then immediately converting into another very semantically similar data structure. For anything more semantically complex, I claim you should get your AST, then give it semantic meaning with other (hopefully highly declarative!) code.

Discussion of XML::Invisible

XML::Invisible follows the concept of Pegex, in having a single grammar that effectively does both lexing and (highly controllable) parsing, but does not do any later processing.

I wanted to get so declarative that I would be able to iterate adjusting the grammar, and inspecting parse outputs (i.e. an AST), without touching any actual code. This was because I was creating a new language for expressing arbitrary transformations of JSON-able data structures without writing code, and I knew that I wouldn't be done with the language design until the simple example transformations were parsing into a comprehensible AST. As I kept tweaking the grammar, I'd need to change the Pegex receiver to match the updated shape, which was a slow and painful process. I felt there must be a better way!

Having stumbled across Steven Pemberton's "Invisible XML" concept a little while ago, eventually the idea occurred of using that together with Pegex to achieve code-free parsing. XML::Invisible operates by creating a Pegex receiver generates ASTs that are semantically equivalent to simple XML documents (representing nodes with attributes, and nodes that are plain text). It [mis]uses a couple of existing annotations to Pegex grammars to control the shape of those XML documents.

Let's look at the fairly trivial grammar in Steven's draft specification, for parsing trivial arithmetic expressions, adapted for Pegex format. Rather than the end product (given in the POD for XML::Invisible), which is a showcase for the features of ixml, let's start with the most naive grammar, then iterate it until we get where we want to go:

expr: open arith close
open: /( LPAREN )/
close: /( RPAREN )/
arith: left op right
left: name
right: name
name: /(a)/ | /(b)/
op: sign
sign: /( PLUS )/

Put this in a file called grammar.pgx. Then install the necessary modules, and parse a trivial expression, (a+b), into an XML document:

cpanm XML::Invisible XML::Twig
perl -MXML::Invisible=make_parser -e 'print make_parser(join "", <>)->(q[(a+b)])->toStringC14N(1)' grammar.pgx | xml_pp

That gives this XML output:

<expr>
  <open>(</open>
  <arith>
    <left>
      <name>a</name>
    </left>
    <op>
      <sign>+</sign>
    </op>
    <right>
      <name>b</name>
    </right>
  </arith>
  <close>)</close>
</expr>

Let's turn the open and close, and also the sign, into attributes, using the + annotation, like this:

expr: +open arith +close
open: /( LPAREN )/
close: /( RPAREN )/
arith: left op right
left: name
right: name
name: /(a)/ | /(b)/
op: +sign
sign: /( PLUS )/

which gives:

<expr close=")" open="(">
  <arith>
    <left>
      <name>a</name>
    </left>
    <op sign="+"></op>
    <right>
      <name>b</name>
    </right>
  </arith>
</expr>

Now let's flatten out the arith entity, which doesn't add much, using the - annotation. This will also demonstrate what that does for flattening the op off the arith entity, which is to incorporate its attribute into the containing entity. The new grammar:

expr: +open -arith +close
open: /( LPAREN )/
close: /( RPAREN )/
arith: left -op right
left: name
right: name
name: /(a)/ | /(b)/
op: +sign
sign: /( PLUS )/

gives:

<expr close=")" open="(" sign="+">
  <left>
    <name>a</name>
  </left>
  <right>
    <name>b</name>
  </right>
</expr>

As a final tweak, let's pretend it's more useful to have the left entity's name be an attribute rather than a contained node, and to flatten the right entity's name out - achieved by adding two characters to the grammar, and changing no code at all. Grammar:

expr: +open -arith +close
open: /( LPAREN )/
close: /( RPAREN )/
arith: left -op right
left: +name
right: -name
name: /(a)/ | /(b)/
op: +sign
sign: /( PLUS )/

Result:

<expr close=")" open="(" sign="+">
  <left name="a"></left>
  <right>b</right>
</expr>

Relationship with XML Schema Definition (XSD)

It should be completely possible to take the Pegex grammar written for use with this module, and generate from it an XSD, since all the required information is available. However, I have not yet written any code to do so. Contributions and/or even informative GitHub issues (see link on the MetaCPAN page for the repo) are extremely welcome!

Future directions

One idea is to use the given grammar to allow turning the AST (i.e. XML document) back into a canonicalised version of the input document.

Acknowledgements

Thanks to the mighty Joel Berger for helping make this be more coherent!

Leave a comment

About Mohawk

user-pic I blog about Perl.