Untrusted Numeric Input
David Farrell's Perl.com article Validating
untrusted input: numbers got me thinking, specifically about
the role of \d
in sanitizing input. I am not going to talk
here about looks_like_number()
, because the referenced
article covers it.
The thing is, on any Perl recent enough to be Unicode-aware,
\d
matches digits, whether or not they are
ASCII. This may be a problem if you are sanitizing data for numeric
conversion, because typically conversion routines expect ASCII
digits. There seem to me to be at least two ways to deal with this:
restrict your regexp patterns to ASCII, or have the conversion routine
deal with the full range of unicode digits.
Restrict Patterns to ASCII
If you truly want ASCII digits for your system, there are a number of ways to restrict a regular expression pattern to ASCII.
Use two-level validation, a.k.a. brute force
By this I simply mean explicitly validating anything that matched
\d
by also matching it against [:ascii:]
in a second regular expression.
Use [0-9]
instead of \d
Beginning with Perl 5.21.5, perlrecharclass
documents that bracketed character classes [A-Z]
,
[a-z]
, [0-9]
, and subranges of these match as
though they were ASCII, even on non-ASCII platforms. I believe this
behavior goes back further, since perl5215delta.pod
calls
this a documentation change rather than a functional change.
Perl-Critic users should be aware that using [0-9]
instead of \d
is a violation of core policy Perl::Critic::Policy::RegularExpressions::ProhibitEnumeratedCharacterClasses. This seems to me to be an exception that proves the rule, in the original meaning of "prove" (i.e. "test").
Use a Unicode character class
There are enough of these that the problem is sifting through them and finding one that does whay you want on your version of Perl. All I can offer here is that \p{IsPosixDigit}
works as far back as Perl 10.1, but is unknown to Perl 5.8.9. The perluniprops documentation calls this a Perl extension, and documents it as matching [0-9]
.
Use the /a
modifier
Beginning with Perl 5.13.10, you can use the /a
regular expression modifier (or equivalently (?a:...)
) to restrict \d
, \s
, \w
, and the POSIX character classes, to match only ASCII.
Note that although perlrecharclass says that [[:digit:]]
matches [0-9]
, it also says it is equivalent to \d
, and experimentation shows that unless qualified some way it matches non-ASCII digits.
Use extended bracketed character classes
Beginning with Perl 5.17.8, you can use extended bracketed character classes to restrict your match to ASCII, replacing \d
with something like (?[ \d & [[:ascii:]] ])
.
Use custom character properties
Beginning with Perl 5.8, you can define your own character properties. These are documented in perlunicode.
For this specific use, it suffices to define a subroutine named, say, IsASCIIDigit
:
sub IsASCIIDigit { return "30 39\n"; }
The name of the subroutine (qualified with package name if necessary) can then be used in a regular expression: qr/ \p{IsASCIIDigit} /smx
.
Use DeMorgan's laws
If you hold your tongue right you can actually put together an old-style bracketed character class that matches only ASCII digits, using DeMorgan's Laws, to wit: [^\D[:^ascii:]]
. The deal here is that we negate a bracketed character class that contains all non-digits and all non-ASCII characters, leaving us with just the
ASCII digits.
This should work back to Perl 5.6, though that version is not recommended for Unicode.
Just Convert Unicode
If you don't want to restrict your digits to ASCII, there are ways of dealing with non-ASCII numbers -- or at least non-ASCII integers.
According to Module::Corelist, Unicode::UCD has been in core since Perl 5.7.3. This module exports, among other things, num()
, which is advertised as converting strings of Unicode digits to numbers. I confess to having no actual experience with this, but the docs say it requires all converted characters to be from
the same script. It converts as many characters as it can, but trailing unconverted characters are not an error. For full validation you would have to either enclose your \d
in a (*script_run:...)
(which requires Perl 5.27.9 or above) or make use of the second argument to retrieve the number of characters actually converted and compare it to the length of the original string.
Or ...
Perl being Perl, I'm sure I have missed other ways of doing it.
Or restrict your data to (binary) bytes instead of (unicode) characters?
TIMTOWTDI in all its splendor :D
Just to add to the mix, when researching my response to a letter triggered by this blog post, I found the following as the last paragraph in the
perlrecharclass
section on POSIX Character Classes:It is proposed to change this behavior in a future release of Perl so that whether or not Unicode rules are in effect would not change the behavior: Outside of locale, the POSIX classes would behave like their ASCII-range counterparts. If you wish to comment on this proposal, send email to "perl5-porters@perl.org".