Cool Perl 6 features available in Perl 5

Today I saw Damian Conway giving a talk at YAPC::NA 2016 (also known as The Perl Conference now :-). He was talking about some cool Perl 6 features, and I realized that some of them are available right now in Perl 5.

  • Parameter lists / "signatures"

    Instead of manually unpacking @_, you can just write sub foo($bar, $baz) { ... } to define a function with two arguments.

    This feature is available in core perl5 since version 20 (the syntax changed slightly and it produces better error messages since version 22). It's still experimental in version 24 (and produces corresponding warnings when enabled).

    However, the CPAN module Function::Parameters adds full support for parameter lists to every perl since 5.14 (albeit with a new keyword (fun and method) instead of sub). It's available right now and not experimental:

    use Function::Parameters qw(:strict);
    fun foo($bar, $baz) {
        ...
    }
    
  • Keyword arguments / named parameters

    By defining your subroutine as sub foo(:$state, :$head, :$halt) {}, you can call it as

    foo(
        head  => 0,
        state => 'A',
        halt  => 'Z',
    );
    

    or

    foo(
        halt  => 'Z',
        state => 'A',
        head  => 0,
    );
    

    or any argument order you like. You no longer have to remember the position of each argument, which is great, especially if your function takes more than 3 arguments (or you haven't touched the code in a month or three).

    This is also available in Function::Parameters from perl 5.14 onwards:

    use Function::Parameters qw(:strict);
    fun foo(:$state, :$head, :$halt) {
    }
    
  • Interpolating blocks in strings

    Perl 5 lets you interpolate variables in double-quoted strings, which can be very convenient:

    say "$greeting, visitor! Would you like some $beverage?";
    

    However, this is limited to variables (scalars and arrays/array slices). There's no way to directly interpolate, say, method or function calls. That's why Perl 6 lets you interpolate arbitrary code in strings by using { blocks }:

    say "2 + 2 = {2 + 2}";  # "2 + 2 = 4"
    

    This feature is available in Quote::Code on CPAN for all perls since 5.14:

    use Quote::Code;
    say qc"2 + 2 = {2 + 2}";
    
  • Funny Unicode variable names

    One of the examples (Pollard ρ-factorization) uses (that's a Greek lowercase rho) as a variable name because Perl 6 supports Unicode in programs by default.

    Perl 5 doesn't. By default, that is. But after a use utf8; declaration, you can put arbitrary Unicode text in your string literals, regexes, etc. And it works for variables, too:

    # this works on any perl version >= 5.8
    use utf8;
    my $ρ = 42;
    print $ρ, "\n";
    

Of course there were many, many other things that are not so easily ported from Perl 6. But I think it's nice how much Just Works (or can be made to work with minimal effort) in existing Perl 5 code.

Perl curio: For loops and statement modifiers

Perl has a "statement modifier" form of most control structures:

EXPR if EXPR;
EXPR unless EXPR;
EXPR while EXPR;
EXPR until EXPR;
EXPR for EXPR;

Perl also has a C-style for loop:

for (INIT; COND; STEP) {
    ...
}

The curious part: COND is a normal expression, but STEP allows statement modifiers. That is, you can write:

for (my $i = 0; $i < 10; $i++ if rand() < 0.5) {
    print "$i\n";
}

Perl curio: Dereferencing blocks

We're all familiar with references and Use Rule 1:

You can always use an array reference, in curly braces, in place of the name of an array.

This leads to code like ${$foo} (dereference a scalar reference) or @{$bar{baz}} (dereference an array reference stored in a hash).

The curious part: The curly braces actually form a block, i.e. you can put multiple statements in there (just like do BLOCK), as long as the last one returns a reference:

% perl -E 'use strict; use warnings; ${say "hi"; \$_} = 42; say $_'
hi
42

This block also gets its own scope:

% perl -E 'use strict; use warnings; ${my $x = "hi"; say $x; \$x} = 42; say $x'
Global symbol "$x" requires explicit package name at -e line 1.
Execution of -e aborted due to compilation errors.

$x isn't visible outside the ${ ... } block it was declared in.

% perl -E 'use strict; use warnings; ${my $x = "hi"; say $x; \$x} = 42;'
hi

Converting glob patterns to regular expressions

Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.

You might start with a naive replacement:

s/\?/./g;   # ? -> .
s/\*/.*/g;  # * -> .*

Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.

Let's revise it:

s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;

Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.

But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).

We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:

tr,*,,s;  # ***...* -> *

This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:

# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**"  (after 1 iteration)
# "??***"  (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;
# "...*"

However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : '\\' . $2
}xeg;

That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.

Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").

And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    \\ (.)     # backslash escape
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : quotemeta $+
}xeg;

Fun with logical expressions 2: Electric boogaloo

tybalt89 discovered a bug in Fun with logical expressions:

$ echo IEabEba | perl try.pl
ORIG: IEabEba
MOD: V~V~V~a~b;~V~~a~~b;;V~V~b~a;~V~~b~~a;;;
...> V~V~V~a~b;~Vab;;V~V~b~a;~Vba;;;
...> V~V~V~a~b;~Vab;;~V~b~a;~Vba;;
Not a tautology

(a equals b) implies (b equals a) is a tautology but the program fails to recognize it. With the fixed code below it generates this output instead:

$ echo IEabEba | perl taut.pl
ORIG: IEabEba
MOD: V~V~V~a~b;~V~~a~~b;;V~V~b~a;~V~~b~~a;;;
...> V~V~V~a~b;~Vab;;V~V~b~a;~Vba;;;
...> V~V~V~a~b;~Vab;;~V~b~a;~Vba;;
...> V~Vba;~V~V~a~b;~Vab;;~V~b~a;;
...> V~Vab;~V~Vab;~V~a~b;;~V~a~b;;
...> V~Vab;~V0~V~a~b;;~V~a~b;;
...> V~Vab;~V~V~a~b;;~V~a~b;;
...> V~Vab;~~V~a~b;~V~a~b;;
...> V~Vab;V~a~b;~V~a~b;;
...> V~Vab;~a~b~V~a~b;;
...> V~Vab;~a~b~V0~b;;
...> V~Vab;~a~b~V~b;;
...> V~Vab;~a~b~~b;
...> V~Vab;~a~bb;
...> V~V1b;~a~bb;
...> V~1~a~bb;
...> V0~a~bb;
...> V~a~bb;
...> V~a~b1;
...> 1
Tautology

The following changes were made to the code:

  1. The "regex libraries" $rawlib and $modlib and their named subpatterns (?&rawexpr) and (?&exp) are gone. They were replaced by $rawexpr and $exp, subregexes that directly match and capture a simplifed and modified expression, respectively.

    This change was made in order to make it possible to split a string into subexpressions using my @expr = $str =~ /$exp/g (i.e. m//g in list context).

  2. Two regexes were simplified by using \K.

  3. The old "final rule" did some duplicate work: It used index($1, $3) < 0 && index($4, $3) < 0 inside the regex to search for the subexpression $3 in both $1 and $4. If successful, it repeated the search in the replacement part: s{ \Q$x\E }{$spec}g for $pre, $post;

    The new version uses s/// directly in the regex and checks the return value to see if any match was found/replaced. The replacement string is assembled directly in the regex and saved in an outer lexical variable to make it available in the right-hand side.

  4. A new rule was added. If the other rules get stuck, it reorders V operands. The canonical order chosen is simply the default sort behavior: lexicographically ascending strings.

Due to change #3 and #4 the code no longer works on old perls (before v5.18) because in v5.18 the implementation of embedded code blocks in regexes was rewritten, fixing many bugs.

The new code:

#!/usr/bin/env perl
use v5.18.0;
use warnings;
use re '/xms';

my $rawexpr = qr{
    (
        (?>
            [a-z]
        |
            N (?-1)
        |
            [CDIE] (?-1) (?-1)
        )
    )
};

my $exp = qr{
    (
        (?>
            [01a-z]
        |
            ~ (?-1)
        |
            V (?-1)*+ ;
        )
    )
};

while (readline) {
    chomp;
    say "ORIG: $_";

    1 while s{ E $rawexpr $rawexpr }{DC$1$2CN$1N$2}g;
    1 while s{ I $rawexpr $rawexpr }{DN$1$2}g;
    1 while s{ C $rawexpr $rawexpr }{NDN$1N$2}g;
    1 while s{ D $rawexpr $rawexpr }{V$1$2;}g;
    tr/N/~/;

    say "MOD: $_";

    say "...> $_" while 0
        || s{ ~ ~ }{}g
        || s{ ~ 0 }{1}g
        || s{ ~ 1 }{0}g
        || s{ V ; }{0}g
        || s{ V $exp ; }{$1}g
        || s{ V $exp* \K V ($exp*+) ; }{$2}g
        || s{ V $exp* \K 0 }{}g
        || s{ V $exp* 1 $exp*+ ; }{1}g
        || do {
            my $repl;
            s{
                V ($exp*?) (~??) $exp ($exp*+) ;
                (?(?{
                    my ($pre, $neg, $x, $post) = ($1, $3, $4, $5);
                    my $spec = $neg ? '1' : '0';
                    my $n = 0;
                    $n += s{ \Q$x\E }{$spec}g for $pre, $post;
                    $repl = "V$pre$neg$x$post;";
                    !$n
                }) (*FAIL) )
            }{$repl}g
        }
        || do {
            my $canon;
            s{
                V ($exp {2,}+) ;
                (?(?{
                    my $orig = $1;
                    $canon = join '', sort $orig =~ m{ $exp }g;
                    $orig eq $canon
                }) (*FAIL) )
            }{V$canon;}g
        }
    ;

    say $_ eq '1' ? "Tautology" : "Not a tautology", "\n";
}

__END__