Inline::C optimizations

At the last (first, actually) meeting last week, I mentioned to a couple of the people present an instance in which I've used the wonderful Inline::C module to replace a chunk of Perl code I couldn't quite get to work fast enough.

I would recommend jumping to the bottom of the post if you'd like to skip stories and just jump to the code.

After some ooohs, aaahs, and mockery... I said I'd follow up and post exactly the code I was talking about. I'd quite much like to see if the Perl version could be modified to be faster in any way: that would be a great learning experience!

I have been involved in MUD development for a while, and my Anacronia MUD had quite a bit of Italian players at the turn of the century. The MUD went through various iterations before closing down roughly before I moved to Scotland. Since then, I have participated in the development of a couple other MUDs, one of which is still open to this day, adding things like MCCP1/2 support (zlib compression over telnet) and other low-level things like that.

Some time ago I decided to start writing a MUD engine in Perl, using more modern tools like Moose for the objects, and POE for the event loop. I'm still considering using Async but lack good examples :(

Since content for testing and trying out things is hard to come by, I "reused/borrowed" some of the older Smaug area files, specifically the "help.are" area file. The file is in a format I know quite well, and basically contains a list of human-readable "man pages" for the MUD, littered with some sigils which the MUD then translates into coloured strings.

Handling these tricky little sigils are the object of the Inline::C optimizations.

Leaving aside the formatting sigils which mean "reset the (foreground|background) color", we're left with the sigils which govern the colors.
For example, the following string can get parsed and produce an ANSI coloured string, with the correct fore/background colour changes:
--^p&WA&yn&Wa&yc&Wr&yo&Wn&yi&Wa &CV4&^--
Here's the image of the (unsightly, I agree!) result:

The format of the sigils is simply an ampersand followed by a colour letter for a foreground color change, and a caret followed by the same colour letter for a background colour change. Capital letters on the foreground set the bold status, etc.

The letters are x, r, g, y, b, p, c, w for black/reset, red, green, yellow and so on

The ansifier does a unpack("C*",$input_string); and then uses a state machine (with state given as the ansifier's parameters) to analyse the character and either output it verbatim, change its internal state, or output a coloured ANSI sequence. This is written in Perl, and works quite well.

A specific bit of the ansifier is the place where the parser is right after a & or a ^: since the next character is supposed to be a color letter, it needs to identify if that's a valid color, and if it is use the corresponding "index number" to build the correct ANSI string. For example, "&x" with no state should output "\e[30m", "&r" with no state should output "\e[31m", etc.

As part of my small MUD test suite, I created a "bot" program which would make a number of connections to the MUD and start shouting coloured strings or reading coloured help pages, or generally stressing out what the server could do. It felt dog slow. Some times I have launched the mud under Devel::NYTProf to see what was the cause, and was not too surprised to see that one of the topmost subroutines was the one which handled the help pages request.

Well, these help pages don't change that often so they can be parsed once and then cached! That problem out of the way, the ansify() function became the most problematic one. You see, it's called whenever output is sent from the MUD to the player. That is, at every prompt or at every output. Again, I did some caching... but players' input is basically random so caching didn't help too much with this.

Delving in the internals of the report I realised that a very, very long time was spent in that tiny fraction of code which looked up whether a character was one of "xrgybpcw" and what its index should be.

The Perl function I had at the time was along these lines:

    my @colors = map { ord } qw/x r g y b p c w/;
    sub perl_getcolor {
        my $clrchar = shift;
        for ( 0 .. $#colors ) {
            return $_ if ( $clrchar == $colors[$_] );
            return $_ if ( ( $clrchar + 32 ) == $colors[$_] );
        return 255;

This works well, but it's quite slow.

If Perl had -funroll-loops, that's where it could be used ;) The resulting "unrolled" Perl subroutine isn't that great either. I have not tested the below, so it may actually be completely wrong!

    #my @colors = map { ord } qw/x r g y b p c w/;
    # 120 114 103 121 98 112 99 119
    sub opt_perl_getcolor {
        #$_[0]+32 == 120 && return 0; # 120 - 32 : 88; etc
        $_[0] == 120 && return 0;
        $_[0] == 88  && return 0; # $_[0]+32 == 120
        $_[0] == 114 && return 1;
        $_[0] == 82  && return 1;
        $_[0] == 103 && return 2;
        $_[0] == 71  && return 2;
        $_[0] == 121 && return 3;
        $_[0] == 89  && return 3;
        $_[0] == 98  && return 4;
        $_[0] == 66  && return 4;
        $_[0] == 112 && return 5;
        $_[0] == 80  && return 5;
        $_[0] == 99  && return 6;
        $_[0] == 67  && return 6;
        $_[0] == 119 && return 7;
        $_[0] == 87  && return 7;
        return 255;

In the MUD, I went with the below Inline::C version:

use Inline C  => <<'END_INLINE_C';
    unsigned char __colors[8] = "xrgybpcw";
    unsigned char getcolor(unsigned char clrchar) {
        char i = 0;
        for ( i = 0; i < 8; i++ ) {
            if (   clrchar     == __colors[i] ) { return i; }
            if ( ( clrchar+32) == __colors[i] ) { return i; }
        return (unsigned char) -1;

I'm aware that the above can be further "optimized" in the same way as the Perl one, of course, but there's not really a huge need for that: premature optimization, yadda yadda yadda

    unsigned char opt_getcolor(unsigned char clrchar) {
        switch (clrchar) {
            case 120: case 88: return 0;
            case 114: case 82: return 1;
            case 103: case 71: return 2;
            case 121: case 89: return 3;
            case 98:  case 66: return 4;
            case 112: case 80: return 5;
            case 99:  case 67: return 6;
            case 119: case 87: return 7;
            default: return (unsigned char) -1;

How well do the above four functions perform? The code is at and the results on my machine below:

                          Rate Pure Perl version Opt Perl version Inline::C version Opt Inline::C version
Pure Perl version      34262/s                --             -65%              -90%                  -91%
Opt Perl version       98814/s              188%               --              -71%                  -75%
Inline::C version     337079/s              884%             241%                --                  -14%
Opt Inline::C version 391645/s             1043%             296%               16%                    --

The Inline::C version gave me a ~10x better performance than the original Perl one, and it made the function go away from the top of the list of subroutines in which the most time was spent. Now I'm left with POE internals occupying the topmost bits, hence my starting to research the Async territory.

Unrolling the loop on the "optimized Perl version" gave it a x3 performance boost. What other trickery would you have performed to make that Perl subroutine faster?

If you're from or come to our meetings, you may want to bring a lightning talk answering the above question ;)
Or, you could leave a comment below

Thanks for reading,

Updated benchmarks at same link (thanks Illusori!): using a hash lookup is certainly faster, and the array lookup is blazing fast:

                  Rate Pure Perl Opt Perl Hash Perl Lookup Perl Inline::C Opt Inline::C
Pure Perl      35294/s        --     -67%      -76%        -80%      -90%          -91%
Opt Perl      106157/s      201%       --      -27%        -39%      -71%          -74%
Hash Perl     146056/s      314%      38%        --        -16%      -60%          -65%
Lookup Perl   174216/s      394%      64%       19%          --      -52%          -58%
Inline::C     363196/s      929%     242%      149%        108%        --          -12%
Opt Inline::C 412088/s     1068%     288%      182%        137%       13%            --

The new array lookup version's performance means the basic Inline::C version is now only 2x as fast as it; Are there any other optimizations that could be done in Perlish land?


I'd always use a hash lookup before looping through an array and comparing values...

What's wrong with doing $colours{ $clrchar }?

And if that's not fast enough for you build the lookup table as an array and do $colours[ $clrchar - 65 ] instead.

My take on this is to use tr. Ends up being nearly as fast as the hash, but not as fast as the array lookup. And that's with a temp variable.

sub tr_perl_getcolor {
my( $char ) = @_;
$char =~ tr/xrgybpcwXRGYBPCW/0123456701234567/
or return 255;
return $char;

Well you don't need to use undefined sections of the array at all really, you can prepopulate the unused parts of the array with the value you want returned.

I don't think there's anything much faster you can do than a single array lookup and return, except either unrolling it from a subroutine so the code is inline where you need it, or making sure you use $_[ 0 ] rather than creating a variable with my.

I'd also consider ditching the whole unpack() algorithm and looking at just using regexps to do the colour replacement, is there a reason for not doing that?

I'd go even a step further on the pure-perl route. Lose the state machine, lose "ord", lose the conditional. Build a table of codes and what they generate, and put it in a hash. Do

    my $codes_regex = join '|', map quotemeta, keys %codes_hash;

so far this is all precomputation. Then boil the entire ansify function down to

    (my $output = $input) =~ s/($codes_regex)/$codes_hash{$1}/g;
    return $output;

and let the regex trie-izer handle optimization for you. Unless there's more complexity to the codes than what you've shown, the job is done, and I bet it's lightning fast. :)

> I'm still considering using Async but lack good examples :(

Do you mean as in IO::Async? Perhaps you could suggest some more examples you'd like to see?

Leave a comment

About Marco Fontani

user-pic Slicing and splicing onions