Actions from mauke

Posted Automated testing on Windows with AppVeyor to mauke

2017-10-21T22:52:20Z

AppVeyor is a continuous integration service similar to Travis CI, just on Windows. If you have a Perl module on GitHub, it's not that hard to have it run tests automatically on Windows; it's just not well documented.

(The following information was taken from https://blogs.perl.org/users/eserte/2016/04/testing-with-appveyor.html, the AppVeyor documentation, and random trial and error.)

First you need to sign in to AppVeyor with your GitHub account and let it access your repositories, as described on https://www.appveyor.com/docs/.

Then you need to add a .appveyor.yml file to your repository. Mine looks like this:

cache:
  - C:\strawberry

install:
  - if not exist "C:\strawberry" choco install strawberryperl -y
  - set PATH=C:\strawberry\c\bin;C:\strawberry\perl\site\bin;C:\strawberry\perl\bin;%PATH%
  - cd %APPVEYOR_BUILD_FOLDER%
  - cpanm --quiet --installdeps --with-develop --notest .

build_script:
  - perl Makefile.PL
  - gmake

test_script:
  - gmake test

The cache part tells AppVeyor to save the contents of C:\strawberry after every successful build and to restore C:\strawberry (if available) before starting a fresh build. See https://www.appveyor.com/docs/build-cache/.

The install script checks for the existence of C:\strawberry. If it's not there, Chocolatey (a Windows package manager) is used to install the Strawberry Perl package (currently using Strawberry Perl 5.26.1.1). Then the relevant program directories are added to the PATH.

The next commands switch to the build directory and install any module dependencies (I run author tests on AppVeyor, so I include developer dependencies).

The build_script and test_script parts are just the usual perl Makefile.PL && make && make test step. Strawberry Perl comes with GNU make now, so we can use gmake instead of the older dmake.

And that's it. My module (including development branches and pull requests) is now automatically tested on Windows.

This post is not directly related to core perl, but AppVeyor was discussed at the 2017 Perl 5 Hackathon and this is what got me to take a closer look at the system (and write up the results). We had a good time.

Sponsors for the Perl 5 Hackathon 2017

This conference could not have happened without the generous contributions from the following companies:

Booking.com, cPanel, craigslist, bluehost, Assurant, Grant Street Group

Posted /Fizz|Buzz/ to mauke

2017-08-05T14:15:38Z

use v5.12.0;
use warnings;

s/\A(?:[0369]|[147][0369]*(?:[147][0369]*[258][0369]*)*(?:[147][0369]*[147]|[258])|[258][0369]*(?:[258][0369]*[147][0369]*)*(?:[258][0369]*[258]|[147]))*(?:0|[147][0369]*(?:[147][0369]*[258][0369]*)*5|[258][0369]*(?:[258][0369]*[147][0369]*)*[258][0369]*5)\z/Fizzbuzz/,
s/\A(?:[0369]|[147][0369]*(?:[147][0369]*[258][0369]*)*(?:[147][0369]*[147]|[258])|[258][0369]*(?:[258][0369]*[147][0369]*)*(?:[258][0369]*[258]|[147]))+\z/Fizz/,
s/\A[0-9]*[05]\z/Buzz/,
say
for 1 .. 100

Posted Converting glob patterns to efficient regexes in Perl and JavaScript to mauke

2017-05-12T19:43:42Z

In Glob Matching Can Be Simple And Fast Too Russ Cox benchmarked several glob implementations by matching the string a¹⁰⁰ against the pattern (a*)ⁿb. (Here exponentiation refers to string repetition, as in 'a' x 100 matched against ('a*' x $n) . 'b' in Perl syntax.) What he found was that some implementations returned results instantly whereas others got extremely slow as soon as n grew past 5, taking seconds, minutes and even hours to finish.

This problem is caused by excessive backtracking. It's possible to implement globbing without nested backtracking (and the linked article explains a simple algorithm to do so), but a naive recursive implementation will suffer from this issue. It affects some shells, FTP servers, and programming languages, including Perl: File::Glob uses code from BSD libc (which is affected, unlike glibc). A patch was written and the next File::Glob release will include a fixed algorithm.

But what really caught my eye was the way Python implements glob matching: It translates each glob pattern to a regex, then simply invokes the regex engine. This approach also suffers from exponential backtracking because Python's regex engine uses an exponential-time algorithm. As it turns out I do something very similar in some of my JavaScript code to support user-specified wildcards.

The question is: Does converting the pattern to a regex suffer from the same problem in JavaScript and Perl, and can we avoid excessive backtracking somehow?

At first blush it seems like Perl is not affected:

$ perl -e '("a" x 100) =~ /\A(?:a.*?){6}b\z/s'

returns instantly. But this is because a specific regex optimization kicks in: Perl first checks whether the fixed substring 'b' appears in the target string, and because it doesn't, the match fails immediately. We can defeat this optimization by switching to:

$ perl -e '("a" x 100) =~ /\A(?:a.*?){6}[bc]\z/s'

$ perl -e '("a" x 100) =~ /\A(?:a.*?){6}b.*?a\z/s'

(Warning: This takes 1½ minutes to finish on my system; increasing the 6 to 7 or 8 is not recommended.)

So this approach seems vulnerable:

#!/usr/bin/env perl
use strict;
use warnings;

sub glob2re {
    my ($pat) = @_;
    $pat =~ s{(\W)}{
        $1 eq '?' ? '.' :
        $1 eq '*' ? '.*?' :
        '\\' . $1
    }eg;
    return qr/\A$pat\z/s;
}

('a' x 100) =~ glob2re(('a*' x 7) . 'b*a');

The above code indeed takes a long time to finish.

What about JavaScript? Here's the equivalent code:

function glob2re(pat) {
    pat = pat.replace(/\W/g, function (m0) {
        return (
            m0 === '?' ? '[\\s\\S]' :
            m0 === '*' ? '[\\s\\S]*?' :
            '\\' + m0
        );
    });
    return new RegExp('^' + pat + '$');
}

glob2re('a*'.repeat(7) + 'b').test('a'.repeat(100));

(Note the extra suckiness: JavaScript has no /s flag so you need something like [\s\S] to match any character.)

I've only tried this in Firefox but it also takes a long time to finish.

Clearly the naive approach does not work well. Can we fix it?

The algorithm presented by Russ Cox works by limiting the stack of saved backtracking states to at most 1 entry (yeah, that's not much of a "stack" anymore). As soon as a new instance of * starts being processed, all previous backtracking information is forgotten.

In Perl we can get the same effect by using the (*PRUNE) control verb:

#!/usr/bin/env perl
use strict;
use warnings;

sub glob2re {
    my ($pat) = @_;
    $pat =~ s{(\W)}{
        $1 eq '?' ? '.' :
        $1 eq '*' ? '(*PRUNE).*?' :
        '\\' . $1
    }eg;
    return qr/\A$pat\z/s;
}

('a' x 100) =~ glob2re(('a*' x 70) . 'b*a');

With this tiny change (adding (*PRUNE) to the generated regex), even 70 wildcards in a single pattern pose no problem: The program finishes instantly.

Again, what about JavaScript? Here the situation is a bit more complicated because JavaScript doesn't support control verbs. Normally this wouldn't be much of a problem because we could just turn foo*bar*baz into /foo(?>.*?bar)(?>.*?baz)/ instead, using (?>...) (an independent subexpression) to limit backtracking. Unfortunately JavaScript doesn't support (?>...) either. But wait! We can combine capturing groups, positive look-ahead, and backreferences to simulate (?>foo): Just use (?=(foo))\1 instead. Well, what we really want is a backreference to the last thing we captured. Perl again makes this easy with a relative backreference (\g{-1}) but in JavaScript we're forced to use absolute numbering instead. We can still do it because we control the whole regex, we just have to do a bit more manual work:

function glob2re(pat) {
    function tr(pat) {
        return pat.replace(/\W/g, function (m0) {
            return (
                m0 === '?' ? '[\\s\\S]' :
                '\\' + m0
            );
        });
    }

    var n = 1;
    pat = pat.replace(/\W[^*]*/g, function (m0, mp, ms) {
        if (m0.charAt(0) !== '*') {
            return tr(m0);
        }
        var eos = mp + m0.length === ms.length ? '$' : '';
        return '(?=([\\s\\S]*?' + tr(m0.substr(1)) + eos + '))\\' + n++;
    });
    return new RegExp('^' + pat + '$');
}

glob2re('a*'.repeat(70) + 'b').test('a'.repeat(100));

The code is starting to look a bit crazy and the regexes it generates are even worse, but it does work: Even 70 wildcards finish instantly.

Conclusion: Yes, converting glob patterns to efficient regexes is possible. It's even trivial in Perl. In JavaScript we have to jump through some annoying hoops but in the end we still get a regex that does what we want.

Commented on Cool Perl 6 features available in Perl 5 in mauke

2016-07-01T21:43:25Z

The problems with @{[ ]} are that it looks quite clunky, it's inefficient at runtime (it builds and dereferences an arrayref), it evaluates its contents in list context, and it gets really confusing if you need (nested) quotes in the interpolated part.

Posted Cool Perl 6 features available in Perl 5 to mauke

2016-06-20T19:28:10Z

Today I saw Damian Conway giving a talk at YAPC::NA 2016 (also known as The Perl Conference now :-). He was talking about some cool Perl 6 features, and I realized that some of them are available right now in Perl 5.

Parameter lists / "signatures"

Instead of manually unpacking @_, you can just write sub foo($bar, $baz) { ... } to define a function with two arguments.

This feature is available in core perl5 since version 20 (the syntax changed slightly and it produces better error messages since version 22). It's still experimental in version 24 (and produces corresponding warnings when enabled).

However, the CPAN module Function::Parameters adds full support for parameter lists to every perl since 5.14 (albeit with a new keyword (fun and method) instead of sub). It's available right now and not experimental:
```
use Function::Parameters qw(:strict);
fun foo($bar, $baz) {
    ...
}
```
Keyword arguments / named parameters

By defining your subroutine as sub foo(:$state, :$head, :$halt) {}, you can call it as
```
foo(
    head  => 0,
    state => 'A',
    halt  => 'Z',
);
```
or
```
foo(
    halt  => 'Z',
    state => 'A',
    head  => 0,
);
```
or any argument order you like. You no longer have to remember the position of each argument, which is great, especially if your function takes more than 3 arguments (or you haven't touched the code in a month or three).

This is also available in Function::Parameters from perl 5.14 onwards:
```
use Function::Parameters qw(:strict);
fun foo(:$state, :$head, :$halt) {
}
```
Interpolating blocks in strings

Perl 5 lets you interpolate variables in double-quoted strings, which can be very convenient:
```
say "$greeting, visitor! Would you like some $beverage?";
```
However, this is limited to variables (scalars and arrays/array slices). There's no way to directly interpolate, say, method or function calls. That's why Perl 6 lets you interpolate arbitrary code in strings by using { blocks }:
```
say "2 + 2 = {2 + 2}";  # "2 + 2 = 4"
```
This feature is available in Quote::Code on CPAN for all perls since 5.14:
```
use Quote::Code;
say qc"2 + 2 = {2 + 2}";
```
Funny Unicode variable names

One of the examples (Pollard ρ-factorization) uses $ρ (that's a Greek lowercase rho) as a variable name because Perl 6 supports Unicode in programs by default.

Perl 5 doesn't. By default, that is. But after a use utf8; declaration, you can put arbitrary Unicode text in your string literals, regexes, etc. And it works for variables, too:
```
# this works on any perl version >= 5.8
use utf8;
my $ρ = 42;
print $ρ, "\n";
```

Of course there were many, many other things that are not so easily ported from Perl 6. But I think it's nice how much Just Works (or can be made to work with minimal effort) in existing Perl 5 code.

Commented on Newbie Poison in Jonathan W. Taylor

2016-06-20T20:59:21Z

This whole post looks unpleasant.

> Many won't even concede that the behavior I called out was unpleasant in the first place.

This sounds like "Many people disagree with me, how dare they" to me. (I'm one of them.)

- "Condescending, abusive advice"
- "So come on in newbies. You're stupid. You fill us with contempt. Get ready to complain. And most of all, don't waste our precious time."
- "antique, pompous, and arrogant elitists who can't even treat each other with kindness and respect"

Isn't all of this condescending, abusive advice itself?

If you don't have anything nice to say, well:

> Silence is better than any response that isn't nice, honest, and helpful.

Commented on I think subroutine signatures don't need arguments count checking in Yuki Kimoto's Perl Blog

2016-04-17T18:44:34Z

I think argument checks are a good thing because correctness is more important than speed (who cares how fast you are at getting the wrong result?).

Checking the arguments manually is annoying boilerplate code that no user wants to bother with. It's much better to be able to abstract it away and have the language do it for you automatically.

That said, Function::Parameters supports both: use Function::Parameters qw(:strict); enables checks, but use Function::Parameters qw(:lax) doesn't, just like my ($x, $y, $z) = @_;.

Posted Perl curio: For loops and statement modifiers to mauke

2016-03-05T11:41:35Z

Perl has a "statement modifier" form of most control structures:

EXPR if EXPR;
EXPR unless EXPR;
EXPR while EXPR;
EXPR until EXPR;
EXPR for EXPR;

Perl also has a C-style for loop:

for (INIT; COND; STEP) {
    ...
}

The curious part: COND is a normal expression, but STEP allows statement modifiers. That is, you can write:

for (my $i = 0; $i < 10; $i++ if rand() < 0.5) {
    print "$i\n";
}

Posted Perl curio: Dereferencing blocks to mauke

2016-03-05T11:05:41Z

We're all familiar with references and Use Rule 1:

You can always use an array reference, in curly braces, in place of the name of an array.

This leads to code like ${$foo} (dereference a scalar reference) or @{$bar{baz}} (dereference an array reference stored in a hash).

The curious part: The curly braces actually form a block, i.e. you can put multiple statements in there (just like do BLOCK), as long as the last one returns a reference:

% perl -E 'use strict; use warnings; ${say "hi"; \$_} = 42; say $_'
hi
42

This block also gets its own scope:

% perl -E 'use strict; use warnings; ${my $x = "hi"; say $x; \$x} = 42; say $x'
Global symbol "$x" requires explicit package name at -e line 1.
Execution of -e aborted due to compilation errors.

$x isn't visible outside the ${ ... } block it was declared in.

% perl -E 'use strict; use warnings; ${my $x = "hi"; say $x; \$x} = 42;'
hi

Commented on Converting glob patterns to regular expressions in mauke

2015-09-05T12:44:50Z

Text::Glob doesn't treat repeated * specially, no. But it supports many other features (wildcards don't match a leading dot, curlies, etc) plus it uses a rather C-like approach to converting the pattern (iterating over single characters, state machine) so it would take more than 10 seconds of looking at it to make any major changes to the algorithm.

Commented on Converting glob patterns to regular expressions in mauke

2015-09-05T12:36:46Z

Eh, it's generated code. I don't care much about how readable it is. :-)

I don't think your version handles the last case with backslash escapes.

Commented on Converting glob patterns to regular expressions in mauke

2015-09-05T12:34:41Z

I'm not doing full-blown shell filename expansion, so I simply don't support [...] here.

Commented on Converting glob patterns to regular expressions in mauke

2015-09-05T12:33:28Z

The original inspiration for this came from matching general text, not file names. That's why / isn't treated specially.

(In one place I use this for matching some HTTP header values, in which case . is actually [!#$%&'*+\-.^`|~\w].)

Posted Converting glob patterns to regular expressions to mauke

2015-08-14T11:42:38Z

Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.

You might start with a naive replacement:

s/\?/./g;   # ? -> .
s/\*/.*/g;  # * -> .*

Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.

Let's revise it:

s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;

Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.

But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(n^k) time (where n is the length of the target string and k is the number of stars in the pattern).

We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:

tr,*,,s;  # ***...* -> *

This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:

# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**"  (after 1 iteration)
# "??***"  (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;
# "...*"

However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : '\\' . $2
}xeg;

That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.

Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").

And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    \\ (.)     # backslash escape
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : quotemeta $+
}xeg;

Posted Fun with logical expressions 2: Electric boogaloo to mauke

2015-04-14T18:15:16Z

tybalt89 discovered a bug in Fun with logical expressions:

$ echo IEabEba | perl try.pl
ORIG: IEabEba
MOD: V~V~V~a~b;~V~~a~~b;;V~V~b~a;~V~~b~~a;;;
...> V~V~V~a~b;~Vab;;V~V~b~a;~Vba;;;
...> V~V~V~a~b;~Vab;;~V~b~a;~Vba;;
Not a tautology

(a equals b) implies (b equals a) is a tautology but the program fails to recognize it. With the fixed code below it generates this output instead:

$ echo IEabEba | perl taut.pl
ORIG: IEabEba
MOD: V~V~V~a~b;~V~~a~~b;;V~V~b~a;~V~~b~~a;;;
...> V~V~V~a~b;~Vab;;V~V~b~a;~Vba;;;
...> V~V~V~a~b;~Vab;;~V~b~a;~Vba;;
...> V~Vba;~V~V~a~b;~Vab;;~V~b~a;;
...> V~Vab;~V~Vab;~V~a~b;;~V~a~b;;
...> V~Vab;~V0~V~a~b;;~V~a~b;;
...> V~Vab;~V~V~a~b;;~V~a~b;;
...> V~Vab;~~V~a~b;~V~a~b;;
...> V~Vab;V~a~b;~V~a~b;;
...> V~Vab;~a~b~V~a~b;;
...> V~Vab;~a~b~V0~b;;
...> V~Vab;~a~b~V~b;;
...> V~Vab;~a~b~~b;
...> V~Vab;~a~bb;
...> V~V1b;~a~bb;
...> V~1~a~bb;
...> V0~a~bb;
...> V~a~bb;
...> V~a~b1;
...> 1
Tautology

The following changes were made to the code:

The "regex libraries" $rawlib and $modlib and their named subpatterns (?&rawexpr) and (?&exp) are gone. They were replaced by $rawexpr and $exp, subregexes that directly match and capture a simplifed and modified expression, respectively.

This change was made in order to make it possible to split a string into subexpressions using my @expr = $str =~ /$exp/g (i.e. m//g in list context).
Two regexes were simplified by using \K.
The old "final rule" did some duplicate work: It used index($1, $3) < 0 && index($4, $3) < 0 inside the regex to search for the subexpression $3 in both $1 and $4. If successful, it repeated the search in the replacement part: s{ \Q$x\E }{$spec}g for $pre, $post;

The new version uses s/// directly in the regex and checks the return value to see if any match was found/replaced. The replacement string is assembled directly in the regex and saved in an outer lexical variable to make it available in the right-hand side.
A new rule was added. If the other rules get stuck, it reorders V operands. The canonical order chosen is simply the default sort behavior: lexicographically ascending strings.

Due to change #3 and #4 the code no longer works on old perls (before v5.18) because in v5.18 the implementation of embedded code blocks in regexes was rewritten, fixing many bugs.

The new code:

#!/usr/bin/env perl
use v5.18.0;
use warnings;
use re '/xms';

my $rawexpr = qr{
    (
        (?>
            [a-z]
        |
            N (?-1)
        |
            [CDIE] (?-1) (?-1)
        )
    )
};

my $exp = qr{
    (
        (?>
            [01a-z]
        |
            ~ (?-1)
        |
            V (?-1)*+ ;
        )
    )
};

while (readline) {
    chomp;
    say "ORIG: $_";

    1 while s{ E $rawexpr $rawexpr }{DC$1$2CN$1N$2}g;
    1 while s{ I $rawexpr $rawexpr }{DN$1$2}g;
    1 while s{ C $rawexpr $rawexpr }{NDN$1N$2}g;
    1 while s{ D $rawexpr $rawexpr }{V$1$2;}g;
    tr/N/~/;

    say "MOD: $_";

    say "...> $_" while 0
        || s{ ~ ~ }{}g
        || s{ ~ 0 }{1}g
        || s{ ~ 1 }{0}g
        || s{ V ; }{0}g
        || s{ V $exp ; }{$1}g
        || s{ V $exp* \K V ($exp*+) ; }{$2}g
        || s{ V $exp* \K 0 }{}g
        || s{ V $exp* 1 $exp*+ ; }{1}g
        || do {
            my $repl;
            s{
                V ($exp*?) (~??) $exp ($exp*+) ;
                (?(?{
                    my ($pre, $neg, $x, $post) = ($1, $3, $4, $5);
                    my $spec = $neg ? '1' : '0';
                    my $n = 0;
                    $n += s{ \Q$x\E }{$spec}g for $pre, $post;
                    $repl = "V$pre$neg$x$post;";
                    !$n
                }) (*FAIL) )
            }{$repl}g
        }
        || do {
            my $canon;
            s{
                V ($exp {2,}+) ;
                (?(?{
                    my $orig = $1;
                    $canon = join '', sort $orig =~ m{ $exp }g;
                    $orig eq $canon
                }) (*FAIL) )
            }{V$canon;}g
        }
    ;

    say $_ eq '1' ? "Tautology" : "Not a tautology", "\n";
}

__END__