\d does not validate numbers

By Ben Bullock on May 6, 2017 1:18 PM

http://stackoverflow.com/questions/43814055/easy-to-check-if-user-input-is-a-number-in-perl

points us to this Perl FAQ:

http://perldoc.perl.org/perlfaq4.html#How-do-I-determine-whether-a-scalar-is-a-number%2fwhole%2finteger%2ffloat%3f

Unfortunately, the regular expression part of the above FAQ page is wrong. \d doesn't validate numbers, unless you have already verified that your input contains only ASCII characters.

What \d does is to validate whether a number is regarded as a numeral in Unicode. For example, \d will happily match things like U+07C2: '߂' NKO DIGIT TWO, or 096F: '९' DEVANAGARI DIGIT NINE, and 360 other characters which are not valid as numerals. If you need to use a regular expression to validate whether something is a number, use [0-9] to match digits, not \d.

The reason I'm aware of these defects in the use of \d for validating numbers is because of having used it to validate user input at the following web pages:

http://www.sljfaq.org/cgi/numbers.cgi

Before I removed \d everywhere a few years ago, it was not uncommon to unravel bugs resulting from people typing in Devangari or other non-ASCII numerals which had been validated using \d.

(This post was edited on 4 December 2018 to remove links to some old pages which are now gone from my web site.)

7 comments

7 Comments

Aristotle | May 7, 2017 12:45 AM | Reply

That’s what /a is for.

Tom Wyant | May 7, 2017 1:08 AM | Reply

Note that /a requires Perl 5.13.10 or higher.

steffenw | May 8, 2017 4:04 PM | Reply

Then [:digit:] is /d and also not [0-9]

Yuki Kimoto | May 8, 2017 11:57 PM | Reply

I'm using [0-9].

\d contains Unicode number.

Ben Bullock replied to comment from Yuki Kimoto | May 9, 2017 11:58 AM | Reply

Yes, I switched to using [0-9] almost everywhere. I think it's simpler.

Karl Williamson | May 11, 2017 4:22 AM | Reply

DEVANAGARI DIGIT NINE, for example, is used by millions of people millions, perhaps billions, of times a day as an essential component of their numbers. I don't know if you are being careless with your terminology, or wrongly arrogant about the place in the universe of [0-9].

Unicode::UCD::num(), since Perl 5.14, can be used to make sure that a string of digits are all from the same script, so are not spoofing attempts, returning the numeric value the string represents, or undef if it is illegal.

Ben Bullock replied to comment from Aristotle | May 11, 2017 8:57 AM | Reply

> That’s what /a is for.

As a followup to this article, I am thinking about making another blog post showing how \d is used to match numbers in actual CPAN modules. It's used for number validation in more than a thousand modules, for example here is the matches for /\\d\./:

http://grep.cpan.me/?q=%5C%5Cd%5C.

and here is the matches for /\\d\+\./:

http://grep.cpan.me/?q=%5C%5Cd%5C%2B%5C..*%2F%5Bb-z%5D*a

Noting your comment, I tried searching for CPAN modules which use the /a flag to restrict \d so that it only matches ASCII digits. I haven't spent very long, but so far I haven't found any.

The following searches bring up a few false positives, but no actual uses of the flag:

http://grep.cpan.me/?q=%5C%5Cd%5C%2B%5C.%5B%5E%2F%5Cn%5D*%2F%5Bb-z%5D*a

http://grep.cpan.me/?q=%5C%5Cd%5C.%5B%5E%2F%5Cn%5D*%2F%5Bb-z%5D*a

Thanks for any assistance with this.