A Marpa-powered SQL-2003 parser

By Jean-Damien Durand on December 14, 2014 4:10 PM

This blog entry is to announce MarpaX-Languages-SQL2003-AST, a Marpa::R2 SQL-2003 parser.

The SQL language is quite big, and consist of approximately ~2500 production rules, and almost as many terminal tokens.

Thanks to Ron's SQL page, the EBNF describing SQL was translated to Marpa::R2's Scanless BNF.

The most interesting things behind this exercice were:

Dealing with known (and native!) SQL grammar ambiguities

Do some semantic actions in the case when the parsing of a terminal depend on a terminal that is after

Provide something useful on the command-line

A note as prolog, one has to eliminate some usual pitfalls with SQL-2003. For instance, it requires a FROM after the SELECT, column names as well their eventual AS alias are not single-quoted, and so on.

Grammar ambiguities can be almost all bypassed natively with Marpa by (ab)using its rank keyword on production rules, e.g.:



L ::= X rank => 0 | Y rank => -1

will give precedence to X if the parse tree has positively evaluated both alternatives. This has been used in all the SQL grammar, assuming that, as in flex where terminals priority by default is their order of appearance, order of appearance of rule alternatives give implicitely their rank.
This does not solve all the issues if more than one terminal matches the input, and then Marpa::R2 latm (== longest acceptable token match) mode comes to the rescue: Marpa::R2 will reject unexpected terminals, and keep only the longest ones that are expected at the position where it is.
The "position" itself is ambiguity aware; and we let Marpa::R2 deal with that.

An interesting case was to add semantics where the parsing of something depend of something that is after. For instance the Unicode Delimited Identifier:



U&"\0441\043F\0430\0441\0438\0431\043E" UESCAPE '#'

Yes, you see that one can say what is the escape character, but this is after the string containing escape sequences... In usual Marpa::R2 way of working, that would have been done at lexing phase. In our case, the whole sentence is considered as a terminal, and a sub-grammar is executed during the parse tree evaluation phase.

Providing a tool to see what really is the AST is quite attractive, therefore a command-line utility sql2003ast is provided, and it is showing in XML format what is the AST -;

For example:



% echo 'delete from mytable' | sql2003ast

<?xml version="1.0" encoding="UTF-8"?>

<SQL_Start_Sequence>

  <SQL_Start_Many>>

    <SQL_Start>

      <Preparable_Statement>

        <Preparable_SQL_Data_Statement>

          <Delete_Statement_Searched>

            <DELETE start="0" length="6" text="delete"

value="delete"/>

            <FROM start="7" length="4" text="from" value="from"/>

            <Target_Table>

              <Table_Name>

                <Local_Or_Schema_Qualified_Name start="12" length="7"

text="mytable" value="mytable"/>

              </Table_Name>

            </Target_Table>

            <Gen3133_Maybe/>

          </Delete_Statement_Searched>

        </Preparable_SQL_Data_Statement>

      </Preparable_Statement>

    </SQL_Start>

  </SQL_Start_Many>

</SQL_Start_Sequence>

Once again, Marpa proves it effectiveness, stability and maturity. Many thanks to its author Jeffrey Kegler.

Have fun -;

0 comments

Tagged as:

BNF, Marpa, Marpa::R2, SQL

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Jean-Damien Durand

About::Me::And::Perl

More info »

Jean-Damien Durand