A Marpa/Moops powered M4 implementation

By Jean-Damien Durand on April 11, 2015 8:16 AM under M4, Parsing

Introduction

The M4 language is a powerful macro processor, turing complete as well as a practical programming language. It is the core tool behind GNU Autoconf, in particular.

MarpaX::Languages::M4 package is a Marpa::R2/Moops powered implementation of it, 99% with the GNU M4 version ([1]) and have switches to alter its behaviour as wanted, so that one can have e.g. POSIX M4 as well.

Command-line

MarpaX::Languages::M4 is distributed with an m4pp command-line, with all the GNU M4 implementation options, plus some other handy items, for example:

conversion to UNIX native end-of-line
comment start and end delimiters
string start and end delimiters

and some "advanced" options to alter the behaviour or extend M4 as you wish:

buffers unwrap order at the end (LIFO or FIFO)
any default can be altered on the command-line
unlimited number of bits for eval arithmetic (thanks to the really remarquable Bit::Vector package)
use perl or GNU Emacs regexps (via the new re::engine::GNU package)
etc...

For instance, it is quite ugly with GNU M4 to have wrapped buffers to be processed in FIFO order instead of its LIFO default. Whereas this is just an option with our package:

(The following examples assumes they are executed via a bourne shell or equivalent)

  echo "changequote(\`[', \`]')dnl backslash only for the shell
  define([cleanup1], [[cleanup1]])dnl
  define([cleanup2], [[cleanup2]])dnl
  m4wrap([cleanup1])m4wrap([ ])m4wrap([cleanup2])dnl
  dnl" | m4pp --m4wrap-order FIFO
  cleanup1 cleanup2

  echo "changequote(\`[', \`]')dnl backslash only for the shell
  define([cleanup1], [[cleanup1]])dnl
  define([cleanup2], [[cleanup2]])dnl
  m4wrap([cleanup1])m4wrap([ ])m4wrap([cleanup2])dnl
  dnl" | m4pp --m4wrap-order LIFO
  cleanup2 cleanup1

Annoyed by the changequote? This can be changed directly on the command-line:

  echo "dnl
  define([cleanup1], [[cleanup1]])dnl
  define([cleanup2], [[cleanup2]])dnl
  m4wrap([cleanup1])m4wrap([ ])m4wrap([cleanup2])dnl
  dnl" | m4pp --quote-start '[' --quote-end ']'
  cleanup2 cleanup1

Or, to do 64bits arithmetic:

  echo "eval([0xffffffff + 1])" \
  | m4pp --quote-start '[' --quote-end ']' --integer_bits 64
  4294967296

Nevertheless, remember that M4 arithmetic is always signed:

  echo "eval([0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF])
        eval([0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF + 1])" \
  | m4pp --quote-start '[' --quote-end ']' --integer_bits 128
  -1
  0

  echo "eval([0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF])
        eval([0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF + 1])" \
  | m4pp --quote-start '[' --quote-end ']' --integer_bits 129
  340282366920938463463374607431768211455
  -340282366920938463463374607431768211456

Since this is a perl implementation, being able to use perl's regexp looked inevitable -;

  echo 'define([foo], [bar])dnl
  define([echo], [$*])dnl
  changecom([/*], [*/])dnl Because comment have higher precedence
  changeword([#([_a-zA-Z0-9]*)])#dnl
  #echo([foo #foo])' \
  | m4pp --quote-start '[' --quote-end ']' --regexp-type perl
  foo bar

  echo 'define([foo], [bar])dnl
  define([echo], [$*])dnl
  changecom([/*], [*/])dnl Because comment have higher precedence
  changeword([#\([_a-zA-Z0-9]*\)])#dnl
  #echo([foo #foo])' \
  | m4pp --quote-start '[' --quote-end ']'
  foo bar

API

The goal was to have a full pure-perl and OO-oriented API'sed M4. The choice of Moops was not only because I like it (!), but because it is doing its job very well. Basically, the API is what the implementation role requires. Many thanks to Toby Inkster for his work on modern perl(5) -;

Moops appears to be a very nice way to write modern perl, leaving the OO implementation elsewhere (Moose, Mouse, Moo and so on), and this is precisely what I was expecting from a pure sugar layer: just do the sugar (I hope not being too wrong when blogging this -;).

Marpa to the rescue

This post would also like to promote an interesting technique with the great Marpa::R2 package from Jeffrey Kegler:

Usually, we expect a parser to take some real input, providing eventually some hooks. E.g. something like:

  my $input = 'This is the input';
  $parser->doit($input);

Ahem, the point here is that the input is real. But Marpa::R2 goes beyond. Although it provides a generic BNF implementation and a Scanless Interface on top of it, nothing is forcing the programmar to let Marpa use a real input: once a grammar is compiled, the programmer can initiate the scanless interface on a fake input, and feed tokens programmatically:

  my $g = Marpa::R2::Scanless::G->new({source => \do {"BNF"}});
  my $r = Marpa::R2::Scanless::R->new({grammar => $g});
  $r->read( \'FAKE INPUT' );

Quoting Jeffrey's documentation on input stream: Virtual input streams complicate the idea of parse location, but they are essential for some applications. Implementing the C language's pre-processor directives requires either two passes, or a virtual approach to the input.. M4 being a pre-processor, we choose the virtual approach.

Now the question is: great, but where am I within the grammar ? Marpa tells you terminals are expected at any point:

  my @expectedTerminals = @{ $r->terminals_expected };

So, then, it is easy to handle 100% of the tokens in user-land, and to profit at the same time of Marpa's SLIF facilities.

  $r->lexeme_read( 'TERMINAL_NAME', $start, $length, $value );

where $start is a position in the stream, $length the length of the token, $value is the attached value (can be anything). From a pure grammar point of view, $start and $length are (Jeffrey, forgive me if I am wrong) used to handle overlapping lexemes. Once a terminal is read, what is important is the next step. So absolutely nothing prevent the programmar to say that all terminals are at the same position and with unimportant length, i.e.:

  $r->lexeme_read( 'TERMINAL_NAME', 0, 1, $value );

Here again, $value is 100% free. So we are done! A virtual input stream is used with Marpa::R2, and all tokens are pushed in the correct order. The final output of the preprocessor is nothing else but the parse tree value:

  my $valueRef = $r->value;

You might say: what if the parse tree is ambiguous ? Of course Marpa::R2 handles this, and our implementation do not mind about this case, because it made sure that its grammar has no place left for ambiguity.

Obviously, this mean that a BNF grammar for the M4 language has been setted up, brave people might want to look at the parser implementation.

Side note: within M4, like other pre-processors AFAIK, arithmetic is a sub-grammar.

Conclusion

I believe this implementation of M4, which should be interchangable with your default's m4 (it is not hasard that autoconf family support an M4 environment variable -;) is breaking a lot of barriers of all known current versions.

It is slower to all of them (which are all native applications), though more powerful -;

Have fun!

Notes

This blog post was writen using Toby's TOBYINK::Pod::HTML!

[1] synchronisation lines are not yet supported, and debug output (a totally free format as per the spec) is mine, although very closed to GNU's.

0 comments

Tagged as:

M4, Marpa

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Jean-Damien Durand

About::Me::And::Perl

More info »

Jean-Damien Durand