Wildcard Unicode Property Values

In addition to Native Variable-Length Lookbehind, Perl 5.29.9 (sic) includes another Regexp enhancement: wildcard Unicode property values. (And yes, this blog post sat around in draft form for over a month.)

Despite its name, the implementation is in terms of regular expressions rather than traditional wildcards. An example may be better than an explanation here: instead of writing /[\p{Script=Latin}\p{Script=Greek}]/, the new feature allows you to write /\p{Script=<\A(Latin|Greek)\z>}/. This is, according to perlunicode, a partial implementation of the Unicode Consortium's Wildcards in Property Values. Something like /\p{<Latin|Greek>}/ will not work, nor will /\p{Is_<Latin|Greek>}/; you must specify property name = ... to access this functionality.

Note the need for anchors in the above example. Something like /\p{Script=<ee>}/ would match any script whose name contained a double "e".

Because Unicode property values are case-blind ASCII, the wildcard specification can be considered to be wrapped in (?iaa: ... ). Note that the /n (no capture) qualifier is not present; /\p{Script=<\AGr(e)\1k\z>} (note the capture and back reference) in fact matches any character in the Greek script. Fortunately this is completely independent of the regular expression in which it is embedded, so you do not need to worry about captures in wildcard property values screwing up capture buffer numbers in the regexp that contains them, or vice versa.

Because the values of Unicode properties are pretty restricted, just about any pumctuation will be considered to delimit a wildcard. Brackets pair in the usual way, and you will need to avoid confusing the parser when it is trying to find the end of the whole regular expression. You may not follow the wildcard with modifiers; they need to be applied with something like (?x: ... ).

Note that this feature is experimental, and may change.

Leave a comment

About Tom Wyant

user-pic Fine Perl code for over 0.005 centuries.