C::Blocks Advent Day 12

This is the C::Blocks Advent Calendar, in which I release a new treat each day about the C::Blocks library. Yesterday I compared C::Blocks to other TinyCC-based Perl libraries. Today I will focus on a fun diversion: hacking on the parser with a bit of C::Blocks code.

I must admit that some of yesterday's results have me a bit depressed. I've put a lot of work into this library, and I am really surprised and worried about the performance cliffs I illustrated yesterday. Today, though, we're going to have some fun.

With C::Blocks it is easy and fairly cheap to get access to the entire Perl API. One aspect of the C API with no built-in Perl hook are the keyword parsing hooks. There are modules for handling this, including Keyword::API (to "original"), Keyword::Simple, and most recently Keyword::Declare, a Damian module that is itself built on Keyword::Simple. Damian mentions that "you either have to write them in XS (shiver!) or use" the other modules I mentioned. Well, today we will shiver! Grab a coat and some hot chocolate because I'm going to use C::Blocks to write a keyword hook in C, just to show how simple it is to grab a random part of the Perl C API and play with it.

There isn't a lot of documentation on writing keyword plugins with C. The best you'll get is the discussion of PL_keyword_plugin from perlapi, which is not bad, but doesn't exactly hold your hand, either. There are two basic things we need to do: (1) install our keyword plugin, and (2) make it do something interesting.

How to Install a Keyword Hook

To get started, here is a skeleton C::Blocks script that installs a nearly useless keyword plugin:

use strict;
use warnings;
use C::Blocks;
use C::Blocks::PerlAPI;

# The simplest keyword hook. It just prints what it
# is given and hands off to the next handler.
clex {
    /* This holds on to the previously installed hook,
     * which we'll eventually call. */
    int (*next_keyword_plugin)(pTHX_ char *, STRLEN, OP **);

    /* Here's our dumb hook */
    int my_keyword_plugin(pTHX_ char *keyword_ptr, STRLEN keyword_len, OP **op_ptr) {
        printf(" -> given keyword [%*s]\n", keyword_len, keyword_ptr);
        return next_keyword_plugin(aTHX_ keyword_ptr, keyword_len, op_ptr);
    }
}

# Install the keyword hook
BEGIN {
    cblock {
        printf("Hooking parser\n");
        next_keyword_plugin = PL_keyword_plugin;
        PL_keyword_plugin = my_keyword_plugin;
    }
    print "Hook installed\n";
}
# From here onward, all keyword-looking things will be printed

print "Hello!\n";
use constant N => 10;
for (my $i = 0; $i < N; $i++) {
    printf "Step $i of %d\n", N;
}

BEGIN {
    print "When is this seen?\n";
    print "And when is THIS seen?\n";
}

# invalid function:
keyword1();

print "All done!\n";

The keyword hook is a C function, so we have to declare it in a clex block. We also declare next_keyword_plugin, which holds the function that used to be at PL_keyword_plugin. Since Larry has already written a keyword handler that does something useful, we'll just call this at the end of our useless keyword handler.

Just declaring a keyword handler doesn't do anything. I have to install it, which is what the cblock inside the BEGIN block does for me. All it has to do is backup PL_keyword_plugin and assign our new handler to it. That's it! All the code that follows is just there to illustrate what gets passed to the keyword handler.

Simple Keyword Hook Output

It may be useless, but I find the output to be fascinating:

$ perl test.pl
Hooking parser
Hook installed
 -> given keyword [print]
 -> given keyword [use]
 -> given keyword [for]
 -> given keyword [my]
 -> given keyword [N]
 -> given keyword [printf]
 -> given keyword [N]
 -> given keyword [BEGIN]
 -> given keyword [print]
 -> given keyword [print]
When is this seen?
And when is THIS seen?
 -> given keyword [keyword1]
 -> given keyword [print]
Hello!
Step 0 of 10
Step 1 of 10
Step 2 of 10
Step 3 of 10
Step 4 of 10
Step 5 of 10
Step 6 of 10
Step 7 of 10
Step 8 of 10
Step 9 of 10
Undefined subroutine &main::keyword1 called at test.pl line 42.

The first thing that caught my attention was that they keyword hook was passed some important built-ins: print, my, use, and even BEGIN! It is conventional to think of the keyword API as a means for adding keywords, but apparently you can use it to modify the behavior of existing ones! Could we completely disable these? I'll try that shortly.

The next thing that caught my attention was the way the hook interacted with BEGIN blocks. The second BEGIN block makes it clear that the parser fully parses the contents of the BEGIN block before running it; this is not a line-by-line parse/execute mode. You probably knew that already, but I feel it's nice to get explicit evidence. This then explains why the hook didn't trigger on the print statement in the first BEGIN block, the one that installed the hook in the first place: that print had already been parsed. This behavior is also quite different from a source filter, which only hooks when you invoke it with a use statement. Interestingly, this means you can potentially use BEGIN blocks to interact with your keyword hook.

True to its name, the keyword hook does not get called on variables, operators, literals, or comments. This is a Good Thing, and is one of the big annoyances of source filters. Smart::Comments are cool, but implementing a source filter properly is hard. Keyword hooks only hook on keywords, but they hook every single time. This means you can keep the scope of your changes minimal.

Finally, we can see that the keywords get fully processed before the script runs. This is not a big surprise since keywords are really a parser hook.

Disabling Things

I mentioned earlier that the hook got called with fundamental keywords. Can we do anything with them? According to what I've seen, Perl has already "consumed" they keyword, so there's no way to completely remove it. You can, however, nullify it. We can do this by inserting a null op and returning a parse result (rather than returning whatever next_keyword_plugin might have returned).

To see this in action, replace my_keyword_plugin with this:

    int my_keyword_plugin(pTHX_ char *keyword_ptr, STRLEN keyword_len, OP **op_ptr) {
        printf(" -> given keyword [%*s]\n", keyword_len, keyword_ptr);

        /* disable BEGIN blocks */
        if (keyword_len == 5 && strEQ(keyword_ptr, "BEGIN")) {
            *op_ptr = newOP(OP_NULL, 0);
            return KEYWORD_PLUGIN_STMT;
        }

        /* everything else as normal */
        return next_keyword_plugin(aTHX_ keyword_ptr, keyword_len, op_ptr);
    }

To disable a keyword of any sort, we have to do two things. First we need to return an op. Perl has already parsed the keyword and knows something is there, so you are obligated to give it an op of some sort. In this case, since we want to disable BEGIN blocks, we return a null op. Second we need to indicate that the this keyword should be parsed as a complete statement, rather than as an expression. Here I picked a statement because expressions need a terminating semicolon while statements do not.

If I put this function in place of my previous keyword plugin, the output verifies that I can turn BEGIN into a no-op:

$ perl test.pl
Hooking parser
...
 -> given keyword [N]
 -> given keyword [BEGIN]
 -> given keyword [print]
 -> given keyword [print]
 -> given keyword [keyword1]
 -> given keyword [print]
Hello!
Step 0 of 10
...
Step 9 of 10
When is this seen?
And when is THIS seen?
Undefined subroutine &main::keyword1 called at test.pl line 51.

Other Keyword Possibilities

You can do many things with a C-based keyword hook. The Perl-level keyword hooks all focus on text replacement: modifying the contents of the source code buffer before it gets sent to the Perl parser. This makes them rather like specialized source filters. You could do the same thing with a C-based keyword hook together with the lexer interface. If that's your aim, you would be better served using the Perl wrappers. Perl really is the best tool at your disposal for string manipulation.

However, with the C interface, you could build your own sequence of ops. Using the lexer interface, you could actually extract some of the contents of the ensuing source code, parse it, build your own op sequence, and let Perl pick back up where you leave off. For example, you could write a keyword hook called jsblock which parses everything between enclosed braces and builds a set of ops corresponding to the meaning of the javascript code. This is completely different from how a source filter works. Here you have an enormous wealth of information about the current state of the parser, including things like variable types. You have access to the entire state of the parser, and can add new lexical PAD entries, for example. In contrast to a source filter, a C-based keyword hook is incredibly powerful and rich.

This is the heart of how C::Blocks itself works. When it sees a recognized keyword, it pulls out all the code until it finds the matching closing brace. If it is dealing with a cblock, it adds an op that executes the code within the block. Otherwise it simply adds a null op and stores the compiled code and symbol table in a safe place for later retrieval. During the extraction process, when it encounters a sigiled variable name, it can query the Perl parser for any type information for that variable name. This kind of rich information is availabe to keyword hooks, but not to source filters.

Retrieving Type Information

I would like to show you one of the useful parts of Perl's C API that can enrich a potential C-based keyword. This is the process for retrieving a lexical variable's type. That is, if we have a variable declared as

my Class::Foo $bar;

how would a keyword figure out that $bar has a type of Class::Foo? First it needs to find the location of $bar in the current lexical pad:

int var_offset = (int)pad_findmy_pv(varname_with_sigil, 0);
if (var_offset != NOT_IN_PAD) {
    ...
}

Second it needs to get the stash associated with the type. Replace those ... with

HV * stash = PAD_COMPNAME_TYPE(var_offset);
if (stash) {
    ...
}

You can think of the stash as the hash underlying the Perl package. If we knew the name of the package, we would have gotten this stash with something like:

HV * stash = gv_stashpvs("Package::Name", 0);

We're going in the reverse direction: PAD_COMPNAME_TYPE gives us the stash directly, and we want its name. Alternatively, you might want to retrieve a method from this package, which is what C::Blocks does with type information. In that case you would use gv_fetchmeth_autoload or a similar function. Since we want to just get the name of the package, we would replace the ... in the previous if statement with something involving HvENAME(stash), like:

printf("Variable %s is of type %s\n", varname_with_sigil, HvENAME(stash));

To make this more useful, though, I'll actually have my keyword, typeof, insert an OP that returns the type as a Perl scalar. This way, the typeof keyword can be used in expressions. Putting this all together, I have this doozy:

use strict;
use warnings;
use C::Blocks;
use C::Blocks::PerlAPI;
use C::Blocks::Types qw(uint);

# Create the "typeof" keyword
clex {
    /* This holds on to the previously installed hook,
     * which we'll eventually call. */
    int (*next_keyword_plugin)(pTHX_ char *, STRLEN, OP **);

    /* returns true if character is part of a valid variable name
     * (I've omitted a couple of corner cases for brefity) */
    int _is_id_cont (char to_check) {
        if('_' == to_check || ('0' <= to_check && to_check <= '9')
            || ('A' <= to_check && to_check <= 'Z')
            || ('a' <= to_check && to_check <= 'z')) return 1;
        return 0;
    }

    /* isolates the variable name following the keyword. Returns the
     * location of the character just past the varname. */
    char * identify_varname_end (pTHX) {
        /* clear out whitespace and prime the pump */
        lex_read_space(0);
        if (PL_parser->bufptr == PL_parser->bufend) lex_next_chunk(0);

        /* make sure keyword is followed by a variable name */
        char * curr = PL_parser->bufptr;
        if (*curr != '$' && *curr != '@' && *curr != '%') {
            croak("typeof expects variable");
        }
        /* find where the variable name ends */
        while (1) {
            curr++;
            if (curr == PL_parser->bufend) {
                int offset = curr - PL_parser->bufptr;
                lex_next_chunk(LEX_KEEP_PREVIOUS);
                curr = PL_parser->bufptr + offset;
            }
            if(!_is_id_cont(*curr)) return curr;
        }
    }

    /* Here's our actual hook for "typeof" */
    int my_keyword_plugin(pTHX_ char *keyword_ptr, STRLEN keyword_len, OP **op_ptr) {
        if (keyword_len == 6 && strEQ(keyword_ptr, "typeof")) {

            /* get end of varname and set it to null so we can use the
             * buffer directly in pad_findmy_pv */
            char * end = identify_varname_end(aTHX);
            char backup = *end;
            *end = '\0';

            /* get the name and store it in a constant op */
            SV * to_return = &PL_sv_undef; /* default to undef */
            int var_offset = (int)pad_findmy_pv(PL_parser->bufptr, 0);
            if (var_offset != NOT_IN_PAD) {
                HV * stash = PAD_COMPNAME_TYPE(var_offset);
                if (stash) {
                    char * name = HvENAME(stash);
                    if (name) {
                        to_return = newSVpv(name, 0);
                    }
                }
            }
            *op_ptr = newSVOP(OP_CONST, 0, to_return);

            /* restore the end character and discard up to it */
            *end = backup;
            lex_unstuff(end);
            return KEYWORD_PLUGIN_EXPR;
        }

        /* everything else as normal */
        return next_keyword_plugin(aTHX_ keyword_ptr, keyword_len, op_ptr);
    }
}

# Install the keyword hook
BEGIN {
    cblock {
        next_keyword_plugin = PL_keyword_plugin;
        PL_keyword_plugin = my_keyword_plugin;
    }
}

# Use an explicit package name
my C::Blocks::Type::NV $fraction;
my uint $counter;
my $basic_var;
print "\$fraction is of type ", typeof $fraction, "\n";
print "\$counter is of type ", typeof $counter, "\n";
if (defined typeof $basic_var) {
    print "\$basic_var is of type ", typeof $basic_var, "???\n";
}
else {
    print "\$basic_var's type is not defined\n";
}

my C::Blocks @foo; # goofy!
print "Type of \@foo is ", typeof @foo, "\n";

# print "typeof junk is ", typeof junk, "\n";

This is enormous. Let me break it down into its chunks. my_keyword_plugin looks for typeof keywords. The variables that follow are then prompted for their type, which is stored in and ultimately returned by an op. _is_id_cont is a utility function for identifying characters that can be part of a variable name. identify_varname_end works with the lexer, first discarding any whitespace, then expanding the buffer until it contains the first character after the variable name.

The last 15 lines are where we get to see it in action. The result of this script is:

$ perl test.pl
$fraction is of type C::Blocks::Type::NV
$counter is of type C::Blocks::Type::uint
$basic_var's type is not defined
Type of @foo is C::Blocks

If you uncomment the last line, you would instead see this:

$ perl test.pl
typeof expects variable at test.pl line 105.

So there you have it. This keyword even dies when you use it improperly!

As you can see, string manipulation in C is extremely verbose, which is why Damian shivered at the thought of it. I'm not going to disagree: it's annoying. But, if we put this effort into writing our keyword parser, we are rewarded at the end by being able to (1) get at a variable's type info and (2) build a constant op that returns this type info, making it accessible at runtime. This might seem useless: you would think you would know your variable's type, just like you know your current package's name and should not need __PACKAGE__. However, this is not actually always true, as illustrated with typeof $counter. Here we use a short name, uint, which actually represents the longer C::Blocks::Type::uint. This could also be useful if you ever wanted to use a pattern in which you declare a variable, then create a new copy of it:

my Class::Name $var;
$var = (typeof $var)->new;

To the best of my knowledge, it's not possible to get a variable's type information any other way, and this just made it accessible at runtime. I don't know about you, but I think that's pretty neat!

Conclusion

Today I showed how to use C::Blocks to play around with Perl's C API, in particular, how to write your own keyword hook. I focused on the keyword API because you can do some pretty crazy and neat things with it. However, you could use this to explore any facet of Perl's C API that amuses you, all within the context of a humble Perl script. In this way, C::Blocks makes it possible for mere mortals to begin playing around with Perl's C API.

C::Blocks Advent Day 1 2 3 4 5 6 7 8 9 10 11 12 13

Leave a comment

About David Mertens

user-pic This is my blog about numerical computing with Perl.