My Favorite Warnings: regexp

'A fair jaw-cracker dwarf-language must be.' -- Samwise Gamgee, The Lord of the Rings, II/iii: "The Ring Goes South", as quoted in regcomp.c, the Perl regular expression compiler.

As you would expect, this category gets you warnings about possibly-problematic regular expression constructions. A couple specific examples are:

Assuming NOT a POSIX class ...

This warning is about things that look kind of like POSIX character classes, but do not parse that way. The full diagnostic gives examples like [[:alnum]] (missing colon) and [[:digit:xyz] (missing right square bracket). These parse like simple character classes ([:[almnu]\] and [:[dgitxyz] respectively), so without the warning you get a hard-to-diagnose bug.

Unescaped left brace in regex is passed through ...

Efforts to eliminate unescaped left braces so that they are available for new syntax have been underway since 5.17.0, released May 2012. As I recall, this effort turned to be much harder than originally anticipated because at least one toolchain external to Perl (autoconf if memory serves) relied on this behavior.

Using /u for ...

The /a and /aa regular expression modifiers cause built-in character classes such as \d to match ASCII only. But some regular expression constructions such as \b{...} are explicitly Unicode. Perl interprets these as written, but warns you. Note that \b{...} is an example of the new functionality added by re-purposing curly brackets.

The above list is far from exhaustive. There are diagnostics for superfluous quantifiers (on zero-width assertions) and greediness specifications (on fixed-width items), since regular expressions are already "A fair jaw-cracker" without the unnecessary cruft. In addition, there are diagnostics for invalid or meaningless uses of the /c, /g, and /p modifiers.

Within the scope of a use re 'strict'; pragma, additional diagnostics are possible. This pragma was the subject of last week's blog, My Favorite Modules: re, which was written as background for this blog entry.

Note that use re 'strict'; is documented as experimental, with the warning that even the interface to the functionality may change. Too bad, because I would kind of like to enable some of the additional diagnostics:

Empty (?) without any modifiers in regex ...

This is of note because one of the diagnostics enabled by use warnings 'ambiguous'; recommends the use of this construction as a way of removing the ambiguity. See My Favorite Warnings: ambiguous for details.

"%s" is more clearly written simply as "%s" ...

This is about representations of single characters. I imagined from the text of the diagnostic that it was about something like writing \x07 versus \a or \N{ALERT}, but I was unable to get this diagnostic after a grueling 2-3 minutes of playing with it.

Unescaped literal right square brackets and braces

Makes sense to me. I did not quote the diagnostic because in this context the '%c' that represents the character is too opaque to be helpful.

Previous entries in this series:

  1. A Belated Introduction

  2. once

  3. redundant and missing

  4. exiting

  5. uninitialized

  6. redefine

  7. Ex-Warnings

  8. deprecated

  9. experimental

  10. shadow

  11. syntax

  12. ambiguous

  13. closure

  14. qw

  15. precedence

Leave a comment

About Tom Wyant

user-pic I blog about Perl.