Perl curio: For loops and statement modifiers

Perl has a "statement modifier" form of most control structures:

EXPR if EXPR;
EXPR unless EXPR;
EXPR while EXPR;
EXPR until EXPR;
EXPR for EXPR;

Perl also has a C-style for loop:

for (INIT; COND; STEP) {
    ...
}

The curious part: COND is a normal expression, but STEP allows statement modifiers. That is, you can write:

for (my $i = 0; $i < 10; $i++ if rand() < 0.5) {
    print "$i\n";
}

Perl curio: Dereferencing blocks

We're all familiar with references and Use Rule 1:

You can always use an array reference, in curly braces, in place of the name of an array.

This leads to code like ${$foo} (dereference a scalar reference) or @{$bar{baz}} (dereference an array reference stored in a hash).

The curious part: The curly braces actually form a block, i.e. you can put multiple statements in there (just like do BLOCK), as long as the last one returns a reference:

% perl -E 'use strict; use warnings; ${say "hi"; \$_} = 42; say $_'
hi
42

This block also gets its own scope:

% perl -E 'use strict; use warnings; ${my $x = "hi"; say $x; \$x} = 42; say $x'
Global symbol "$x" requires explicit package name at -e line 1.
Execution of -e aborted due to compilation errors.

$x isn't visible outside the ${ ... } block it was declared in.

% perl -E 'use strict; use warnings; ${my $x = "hi"; say $x; \$x} = 42;'
hi

Converting glob patterns to regular expressions

Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.

You might start with a naive replacement:

s/\?/./g;   # ? -> .
s/\*/.*/g;  # * -> .*

Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.

Let's revise it:

s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;

Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.

But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).

We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:

tr,*,,s;  # ***...* -> *

This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:

# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**"  (after 1 iteration)
# "??***"  (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;
# "...*"

However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : '\\' . $2
}xeg;

That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.

Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").

And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    \\ (.)     # backslash escape
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : quotemeta $+
}xeg;

Fun with logical expressions 2: Electric boogaloo

tybalt89 discovered a bug in Fun with logical expressions:

$ echo IEabEba | perl try.pl
ORIG: IEabEba
MOD: V~V~V~a~b;~V~~a~~b;;V~V~b~a;~V~~b~~a;;;
...> V~V~V~a~b;~Vab;;V~V~b~a;~Vba;;;
...> V~V~V~a~b;~Vab;;~V~b~a;~Vba;;
Not a tautology

(a equals b) implies (b equals a) is a tautology but the program fails to recognize it. With the fixed code below it generates this output instead:

$ echo IEabEba | perl taut.pl
ORIG: IEabEba
MOD: V~V~V~a~b;~V~~a~~b;;V~V~b~a;~V~~b~~a;;;
...> V~V~V~a~b;~Vab;;V~V~b~a;~Vba;;;
...> V~V~V~a~b;~Vab;;~V~b~a;~Vba;;
...> V~Vba;~V~V~a~b;~Vab;;~V~b~a;;
...> V~Vab;~V~Vab;~V~a~b;;~V~a~b;;
...> V~Vab;~V0~V~a~b;;~V~a~b;;
...> V~Vab;~V~V~a~b;;~V~a~b;;
...> V~Vab;~~V~a~b;~V~a~b;;
...> V~Vab;V~a~b;~V~a~b;;
...> V~Vab;~a~b~V~a~b;;
...> V~Vab;~a~b~V0~b;;
...> V~Vab;~a~b~V~b;;
...> V~Vab;~a~b~~b;
...> V~Vab;~a~bb;
...> V~V1b;~a~bb;
...> V~1~a~bb;
...> V0~a~bb;
...> V~a~bb;
...> V~a~b1;
...> 1
Tautology

The following changes were made to the code:

  1. The "regex libraries" $rawlib and $modlib and their named subpatterns (?&rawexpr) and (?&exp) are gone. They were replaced by $rawexpr and $exp, subregexes that directly match and capture a simplifed and modified expression, respectively.

    This change was made in order to make it possible to split a string into subexpressions using my @expr = $str =~ /$exp/g (i.e. m//g in list context).

  2. Two regexes were simplified by using \K.

  3. The old "final rule" did some duplicate work: It used index($1, $3) < 0 && index($4, $3) < 0 inside the regex to search for the subexpression $3 in both $1 and $4. If successful, it repeated the search in the replacement part: s{ \Q$x\E }{$spec}g for $pre, $post;

    The new version uses s/// directly in the regex and checks the return value to see if any match was found/replaced. The replacement string is assembled directly in the regex and saved in an outer lexical variable to make it available in the right-hand side.

  4. A new rule was added. If the other rules get stuck, it reorders V operands. The canonical order chosen is simply the default sort behavior: lexicographically ascending strings.

Due to change #3 and #4 the code no longer works on old perls (before v5.18) because in v5.18 the implementation of embedded code blocks in regexes was rewritten, fixing many bugs.

The new code:

#!/usr/bin/env perl
use v5.18.0;
use warnings;
use re '/xms';

my $rawexpr = qr{
    (
        (?>
            [a-z]
        |
            N (?-1)
        |
            [CDIE] (?-1) (?-1)
        )
    )
};

my $exp = qr{
    (
        (?>
            [01a-z]
        |
            ~ (?-1)
        |
            V (?-1)*+ ;
        )
    )
};

while (readline) {
    chomp;
    say "ORIG: $_";

    1 while s{ E $rawexpr $rawexpr }{DC$1$2CN$1N$2}g;
    1 while s{ I $rawexpr $rawexpr }{DN$1$2}g;
    1 while s{ C $rawexpr $rawexpr }{NDN$1N$2}g;
    1 while s{ D $rawexpr $rawexpr }{V$1$2;}g;
    tr/N/~/;

    say "MOD: $_";

    say "...> $_" while 0
        || s{ ~ ~ }{}g
        || s{ ~ 0 }{1}g
        || s{ ~ 1 }{0}g
        || s{ V ; }{0}g
        || s{ V $exp ; }{$1}g
        || s{ V $exp* \K V ($exp*+) ; }{$2}g
        || s{ V $exp* \K 0 }{}g
        || s{ V $exp* 1 $exp*+ ; }{1}g
        || do {
            my $repl;
            s{
                V ($exp*?) (~??) $exp ($exp*+) ;
                (?(?{
                    my ($pre, $neg, $x, $post) = ($1, $3, $4, $5);
                    my $spec = $neg ? '1' : '0';
                    my $n = 0;
                    $n += s{ \Q$x\E }{$spec}g for $pre, $post;
                    $repl = "V$pre$neg$x$post;";
                    !$n
                }) (*FAIL) )
            }{$repl}g
        }
        || do {
            my $canon;
            s{
                V ($exp {2,}+) ;
                (?(?{
                    my $orig = $1;
                    $canon = join '', sort $orig =~ m{ $exp }g;
                    $orig eq $canon
                }) (*FAIL) )
            }{V$canon;}g
        }
    ;

    say $_ eq '1' ? "Tautology" : "Not a tautology", "\n";
}

__END__

Functional fun with logical expressions

Here's a quickly thrown-together version of "Fun with logical expressions", written in Haskell:

{-# LANGUAGE LambdaCase #-}

module Main (main) where
import Control.Applicative (Applicative (..), (<$>))
import Control.Arrow       (first)
import Data.Ix             (inRange)
import Control.Monad       (guard, mplus, msum)
import Data.Maybe          (fromMaybe, isJust, listToMaybe)

data RawExpr
    = VarR Char
    | N RawExpr
    | D RawExpr RawExpr
    | C RawExpr RawExpr
    | I RawExpr RawExpr
    | E RawExpr RawExpr
    deriving (Eq, Ord, Read, Show)

newtype Parser a = Parser{ unParser :: String -> (a, String) }

instance Functor Parser where
    fmap f (Parser p) = Parser (first f . p)

instance Applicative Parser where
    pure x = Parser (\s -> (x, s))
    Parser f <*> Parser x = Parser (\s ->
        let (f', s')  = f s
            (x', s'') = x s'
        in (f' x', s''))

instance Monad Parser where
    return = pure
    Parser x >>= f = Parser (\s ->
        let (x', s') = x s
        in unParser (f x') s')

anyChar :: Parser Char
anyChar = Parser (\(c : cs) -> (c, cs))

parseRaw :: String -> RawExpr
parseRaw s = case unParser expr s of
    (x, "")   -> x
    (_, rest) -> error ("parseRaw: trailing garbage: " ++ show rest)
    where
    expr = anyChar >>= \case
        c | inRange ('a', 'z') c -> pure (VarR c)
        'N' -> N <$> expr
        'D' -> D <$> expr <*> expr
        'C' -> C <$> expr <*> expr
        'I' -> I <$> expr <*> expr
        'E' -> E <$> expr <*> expr

data ModExpr
    = VarM Char
    | T
    | F
    | Not ModExpr
    | Or [ModExpr]
    deriving (Eq, Ord, Read, Show)

desugar :: RawExpr -> ModExpr
desugar = \case
    VarR v -> VarM v
    N x    -> not (desugar x)
    D x y  -> or  (desugar x) (desugar y)
    C x y  -> and (desugar x) (desugar y)
    I x y  -> or (not (desugar x)) (desugar y)
    E x y  -> or (and x' y') (and (not x') (not y'))
        where
        x' = desugar x
        y' = desugar y
    where
    not     = Not
    or  x y = Or [x, y]
    and x y = not (or (not x) (not y))

render :: ModExpr -> String
render e = go e "" where
    go = \case
        VarM c -> (c :)
        T      -> ('1' :)
        F      -> ('0' :)
        Not x  -> ('~' :) . go x
        Or xs  -> ('V' :) . foldr (.) id (map go xs) . (';' :)

type Rule = ModExpr -> Maybe ModExpr

mkRule :: (Maybe ModExpr -> Maybe ModExpr) -> Rule
mkRule t = rule
    where
    rule e = t (Just e) `mplus` rec
        where
        rec = case e of
            Not x -> Not <$> rule x
            Or xs -> do
                let ys = map rule xs
                guard (any isJust ys)
                return (Or (zipWith fromMaybe xs ys))
            _ -> Nothing

not_not, not_true, not_false :: Rule
not_not = mkRule $ \e -> do
    Not (Not x) <- e
    pure x
not_true = mkRule $ \e -> do
    Not T <- e
    pure F
not_false = mkRule $ \e -> do
    Not F <- e
    pure T

or_true, or_false :: Rule
or_true = mkRule $ \e -> do
    Or xs <- e
    guard (T `elem` xs)
    pure T
or_false = mkRule $ \e -> do
    Or xs <- e
    let ys = filter (F /=) xs
    guard (length ys < length xs)
    pure $ Or ys

first3 :: (a -> r) -> (a, b, c) -> (r, b, c)
first3 f (x, y, z) = (f x, y, z)

second3 :: (b -> r) -> (a, b, c) -> (a, r, c)
second3 f (x, y, z) = (x, f y, z)

selectSplit :: [a] -> [([a], a, [a])]
selectSplit []       = []
selectSplit (x : xs) = ([], x, xs) : map (first3 (x :)) (selectSplit xs)

extract :: (a -> Maybe b) -> [a] -> Maybe ([a], b, [a])
extract f xs = msum (map (evert . second3 f) (selectSplit xs))
    where
    evert (x, Just y, z) = Just (x, y, z)
    evert _              = Nothing

or_none, or_one, or_assoc :: Rule
or_none = mkRule $ \e -> do
    Or [] <- e
    pure F
or_one = mkRule $ \e -> do
    Or [x] <- e
    pure x
or_assoc = mkRule $ \e -> do
    Or xs <- e
    (pre, ys, post) <- extractOr xs
    pure $ Or (pre ++ ys ++ post)
    where
    extractOr = extract (\case Or xs -> Just xs; _ -> Nothing)

isSubExpr :: ModExpr -> ModExpr -> Bool
isSubExpr e1 e2 =
    e1 == e2 || case e2 of
        Not x -> e1 `isSubExpr` x
        Or xs -> any (e1 `isSubExpr`) xs
        _ -> False

substitute :: ModExpr -> ModExpr -> ModExpr -> ModExpr
substitute x y = go
    where
    go = \case
        e | e == x -> y
        Not e      -> Not (go e)
        Or es      -> Or (map go es)
        e          -> e

or_spec :: Rule
or_spec = mkRule $ \e -> do
    Or xs <- e
    listToMaybe $ do
        (pre, x_orig, post) <- selectSplit xs
        (Not x, r) <- [(x_orig, T), (Not x_orig, F)]
        guard $ any (x `isSubExpr`) (pre ++ post)
        let z = Or (map (substitute x r) pre ++ [x_orig] ++ map (substitute x r) post)
        pure z

rules :: [Rule]
rules =
    [
        not_not, not_true, not_false,
        or_true, or_false,
        or_none, or_one, or_assoc,
        or_spec
    ]

simplify1 :: ModExpr -> Maybe ModExpr
simplify1 e = msum (map ($ e) rules)

simplify :: ModExpr -> (ModExpr, [String])
simplify e = case simplify1 e of
    Nothing -> (e, [])
    Just e' -> (render e' :) <$> simplify e'

process :: String -> String
process s = unlines $
    ("ORIG: " ++ s) :
    ("MOD: " ++ render e) :
    foldr (\x z -> ("...> " ++ x) : z) [v, ""] ts
    where
    e = desugar (parseRaw s)
    (r, ts) = simplify e
    v | r == T    = "Tautology"
      | otherwise = "Not a tautology"

main :: IO ()
main = interact (unlines . map process . lines)