Untrusted Numeric Input

By Tom Wyant on January 15, 2019 2:26 AM

David Farrell's Perl.com article Validating untrusted input: numbers got me thinking, specifically about the role of \d in sanitizing input. I am not going to talk here about looks_like_number(), because the referenced article covers it.

The thing is, on any Perl recent enough to be Unicode-aware, \d matches digits, whether or not they are ASCII. This may be a problem if you are sanitizing data for numeric conversion, because typically conversion routines expect ASCII digits. There seem to me to be at least two ways to deal with this: restrict your regexp patterns to ASCII, or have the conversion routine deal with the full range of unicode digits.

Restrict Patterns to ASCII

If you truly want ASCII digits for your system, there are a number of ways to restrict a regular expression pattern to ASCII.

Use two-level validation, a.k.a. brute force

By this I simply mean explicitly validating anything that matched \d by also matching it against [:ascii:] in a second regular expression.

Use `[0-9]` instead of `\d`

Beginning with Perl 5.21.5, perlrecharclass documents that bracketed character classes [A-Z], [a-z], [0-9], and subranges of these match as though they were ASCII, even on non-ASCII platforms. I believe this behavior goes back further, since perl5215delta.pod calls this a documentation change rather than a functional change.

Perl-Critic users should be aware that using [0-9] instead of \d is a violation of core policy Perl::Critic::Policy::RegularExpressions::ProhibitEnumeratedCharacterClasses. This seems to me to be an exception that proves the rule, in the original meaning of "prove" (i.e. "test").

Use a Unicode character class

There are enough of these that the problem is sifting through them and finding one that does whay you want on your version of Perl. All I can offer here is that \p{IsPosixDigit} works as far back as Perl 10.1, but is unknown to Perl 5.8.9. The perluniprops documentation calls this a Perl extension, and documents it as matching [0-9].

Use the `/a` modifier

Beginning with Perl 5.13.10, you can use the /a regular expression modifier (or equivalently (?a:...)) to restrict \d, \s, \w, and the POSIX character classes, to match only ASCII.

Note that although perlrecharclass says that [[:digit:]] matches [0-9], it also says it is equivalent to \d, and experimentation shows that unless qualified some way it matches non-ASCII digits.

Use extended bracketed character classes

Beginning with Perl 5.17.8, you can use extended bracketed character classes to restrict your match to ASCII, replacing \d with something like (?[ \d & [[:ascii:]] ]).

Use custom character properties

Beginning with Perl 5.8, you can define your own character properties. These are documented in perlunicode.

For this specific use, it suffices to define a subroutine named, say, IsASCIIDigit:

sub IsASCIIDigit {
    return "30 39\n";
}

The name of the subroutine (qualified with package name if necessary) can then be used in a regular expression: qr/ \p{IsASCIIDigit} /smx.

Use DeMorgan's laws

If you hold your tongue right you can actually put together an old-style bracketed character class that matches only ASCII digits, using DeMorgan's Laws, to wit: [^\D[:^ascii:]]. The deal here is that we negate a bracketed character class that contains all non-digits and all non-ASCII characters, leaving us with just the ASCII digits.

This should work back to Perl 5.6, though that version is not recommended for Unicode.

Just Convert Unicode

If you don't want to restrict your digits to ASCII, there are ways of dealing with non-ASCII numbers -- or at least non-ASCII integers.

According to Module::Corelist, Unicode::UCD has been in core since Perl 5.7.3. This module exports, among other things, num(), which is advertised as converting strings of Unicode digits to numbers. I confess to having no actual experience with this, but the docs say it requires all converted characters to be from the same script. It converts as many characters as it can, but trailing unconverted characters are not an error. For full validation you would have to either enclose your \d in a (*script_run:...) (which requires Perl 5.27.9 or above) or make use of the second argument to retrieve the number of characters actually converted and compare it to the length of the original string.

Or ...

Perl being Perl, I'm sure I have missed other ways of doing it.

3 comments

3 Comments

jrw32982 | January 18, 2019 1:05 AM | Reply

Or restrict your data to (binary) bytes instead of (unicode) characters?

Olivier Duclos | January 21, 2019 8:58 AM | Reply

TIMTOWTDI in all its splendor :D

Tom Wyant | January 21, 2019 3:39 PM | Reply

Just to add to the mix, when researching my response to a letter triggered by this blog post, I found the following as the last paragraph in the perlrecharclass section on POSIX Character Classes:

It is proposed to change this behavior in a future release of Perl so that whether or not Unicode rules are in effect would not change the behavior: Outside of locale, the POSIX classes would behave like their ASCII-range counterparts. If you wish to comment on this proposal, send email to "perl5-porters@perl.org".

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Tom Wyant

I blog about Perl.

More info »

Tom Wyant