Perl regex escapes by version

Tom Christiansen compiled a table of escape sequences by the version of Perl that introduced them. This is the sort of Perl documentation that I like. It's too bad he doesn't blog, but I don't think he'll mind me reposting this part of his private email. :)

We've come up with all sorts of good ideas for Programming Perl since we turned the book in two weeks ago.

(Download as gist 1342877)

This list is sorted by escape, but you know Perl so you can re-sort it by version yourself:

# compiled by Tom Christiansen
v1.0	\0, \0N,\0NN	Match octal character up to octal 077.
v1.0	\N, \NN, \NNN	Match Nth capture group (decimal) if not in charclass and that many seen, else (octal) character up to octal 377.
v4.0	\a	Match the alert character (ALERT, BEL).
v5.0	\A	True at the beginning of a string only, not in charclass.
v1.0	\b	Match the backspace char (BACKSPACE, BS) in charclass only.
v1.0	\b	True at Unicode word boundary, outside of charclass only.
v1.0	\B	True when not at Unicode word boundary, not in charclass.
v4.0	\cX	Match ASCII control character Control-X (\cZ, \c[, \c?, etc).
v5.6	\C	Match one byte (C char) even in UTF‑8 (dangerous!), not in charclass.
v1.0	\d	Match any Unicode digit character.
v1.0	\D	Match any Unicode nondigit character.
v4.0	\e	Match the escape character (ESCAPE, ESC, not backslash).
v4.0	\E	End case (\F, \L, \U) or quotemeta (\Q) translation, only if interpolated.
v1.0	\f	Match the form feed character (FORM FEED, FF).
v5.16	\F	Foldcase (not lowercase) till \E, only if interpolated.
v5.10	\g{GROUP}	Match the named or numbered capture group, not in charclass.
v5.0	\G	True at end-of-match position of prior m//g or pos() setting, not in charclass.
v5.10	\h	Match any Unicode horizontal whitespace character.
v5.10	\H	Match any Unicode character except horizontal whitespace.
v5.10	\k	Match the named (not numbered) capture group; also \k'GROUP', not in charclass.
v5.10	\K	Keep text to the left of \K out of match, not in charclass.
v4.0	\l	Lowercase (not foldcase) next character only, only if interpolated.
v4.0	\L	Lowercase (not foldcase) till \E, only if interpolated.
v1.0	\n	Match the newline character (usually LINE FEED, LF).
v5.12	\N	Match any character except newline.
v5.6	\N{CHARNAME}	Match the named character, named alias, or named sequence, but only if interpolated and "use charnames" loaded.
v5.12	\N{U+XXXXX}	Match Unicode character given in any number of hex digits.
v5.14	\o{NNNNNN}	Match the character given in any number of octal digits.
v5.6	\p{PROPERTY}	Match any character with the named property.
v5.6	\P{PROPERTY}	Match any character without the named property.
v4.0	\Q	Quote (de-meta) metacharacters till \E.
v1.0	\r	Match the return character (usually CARRIAGE RETURN, CR).
v5.10	\R	Match any Unicode linebreak grapheme, only outside of charclass.
v1.0	\s	Match any Unicode whitespace character except \cK.
v1.0	\S	Match any Unicode nonwhitespace character or \cK.
v1.0	\t	Match the tab character (CHARACTER TABULATION, HT).
v4.0	\u	Titlecase (not uppercase) next character only, only if interpolated.
v4.0	\U	Uppercase (not titlecase) till \E, only if interpolated.
v5.10	\v	Match any Unicode vertical whitespace character.
v5.10	\V	Match any character except Unicode vertical whitespace.
v1.0	\w	Match any Unicode "word" character (alphabetics, digits, combining marks, and connector punctuation)
v1.0	\W	Match any Unicode nonword character.
v4.0	\xH	Match the character given in one hex digit.
v4.0	\xHH	Match the character given in two hex digits.
v5.6	\x{HHHHHH}	Match the character given in any number of hex digits.
v5.6	\X	Match Unicode extended grapheme cluster, only outside of charclass.
v5.5	\z	True at end of string only.
v5.0	\Z	True right before final newline, or at end of string.

5 Comments

Wonderful! Thank you.

I take it \F is a preview of coming attractions, since it appears not to be in 5.15.4.

It appears to me that \N{U+XXXX} goes back to 5.8. At least, my copy of 5.8.8 prints 'yes' when fed

perl -le 'print $ARGV[0] =~ m/\N{U+41}/ ? "yes" : "no"' A.

My notes for PPIx::Regexp say this is documented in the 5.8.0 charnames (not perlre), though I have not dug it out to confirm this.

It appears to me (after a little playing around printing qr{...} serializations) that 5.12 is the release Perl started serializing \N{CHARNAME} as \N{U+XXXX}. It also appears to be 5.12 when \N{U+XXXX} made it into perlre.

Very useful! Thanks to you and Tom!

--
chansen

@Tom Wyant: fold case is a planned feature, not an implemented feature. I hope it really makes it into 5.16, though given it made it into Programming Perl I assume someone (Karl?) is working on it already.

Leon, I believe that Brian Fraser already has the code done for the fc feature (although I've so far only read his docs, myself).

I sent him what I said about it in the Camel, so that he could merge anything that seemed useful.

I just did a pull, and I don't think he has a branch for it yet, though.

I should also mention that we already have a couple of pure-Perl versions of fc floating around. I wrote the original, and then Karl took my code and elaborated his own version based on mine (which wasn't very clever with its caching). That will go into Unicode::Casing as a fallback in case people want it who can't upgrade to whatever version of Perl sees the feature in its bundle.

Leave a comment

About brian d foy

user-pic I'm the author of Mastering Perl, and the co-author of Learning Perl (6th Edition), Intermediate Perl, Programming Perl (4th Edition) and Effective Perl Programming (2nd Edition).