Untrusted Numeric Input -- /[0-9]/

My blog entry of a couple weeks ago, Untrusted Numeric Input, dealt mainly with the problem of ensuring that supposedly-numeric input actually consisted only of ASCII digits. One of the ways to do this was to use the bracketed character class [0-9] instead of \d. This was documented as being portable as of Perl 5.21.5, and I made the statement that "I believe this behavior goes back further ..." This was clearly just hopeful hand-waving, and not very helpful.

This blog post documents my efforts to try to quantify the versions of Perl under which [0-9] is portable. For those disinclined to read further, my conclusion is that [A-Z], [a-z], [0-9], and their sub-ranges are portable among character sets as far back as Perl 5.8.0.

Since I do not have a non-ASCII environment at my disposal for testing, the next best way to investigate was to read the Perl code. I make no claim to an exact understanding of any part of the Perl source, but in this case I believe I was able to come to some useful conclusions.

The relevant source module appears to be regcomp.c, which compiles regular expressions. I dipped into this module in blead (as of January 21 2018, commit 129da27a6cbce7395f91b30779d708adc609e29c) (about line 17546), perl-5.10.0 (about line 8098), and perl-5.8.0 (about line 4096). I did not go any farther back because of earlier versions' lack of support for character sets other than native.

Each of these versions of Perl had code that was conditionalized on symbol EBCDIC being defined. For Unix-like systems this is defined (or not) by the Configure script. The assumption here is that if the encoding is not some variant or extension of ASCII, it must be EBCDIC. This is false in general, but is probably true if we restrict ourselves to the encodings used by operating systems that will actually run Perl.

Now my memory is that EBCDIC is not a single encoding, but a family of them that differ in subtle ways from each other. Google turned up a bunch of EBCDIC code tables, of which I have to consider IBM's authoritative if any of them is. I did not compare them in detail, but all shared the following features:

  • The upper-case alphabetic code points were monotonic (that is, the values of the code points increased as you went forward through the alphabet) but discontiguous, with a gap between "I" and "J", and another between "R" and "S". The gaps contained non-alphabetics. The gap between "R" and "S" is used by Configure to detect EBCDIC.
  • The lower-case alphabetic code points had the same features as the upper-case code points (i.e. monotonic and discontiguous at the same points), and did not overlap them.
  • The numeric code points were consecutive -- that is, both monotonic and contiguous.

Getting back to regcomp.c, I found that in each of the versions I examined, if symbol EBCDIC was defined, there was special-case handling of bracketed character classes [A-Z], [a-z], and sub-ranges thereof. This special-case code iterated over all encodings in the range, but only included in the compiled character class encodings that were uppercase (or lowercase) alphabetics as determined by the C language character classification functions/macros. Given this special-case code, it looks to me like Perl's implementation of [A-Z], [a-z], and sub-ranges thereof is portable between ASCII and EBCDIC (and supersets of them) as far back as Perl 5.8.0.

Without this specal case code, bracketed character classes were considered to consist of all code points between the beginning and the end of the class, inclusive. Interestingly, this means that [0-9] is portable among ASCII, EBCDIC, and any encoding where the code points for the digits are consecutive, even without special-case code.

I attempted to test my conclusions about the alphabetics by building Perl 5.8.0, editing configure.sh so that EBCDIC was defined. I failed to get this to link, but I was able to build a normal Perl, with regcomp.c edited to define EBCDIC and to print a debug message when the special-case code was encountered. This printed the message when compiling /[A-Z]/, so it looks like the special-case code is used when I think it is.

All of this seems to say that the statement in perlrecharclass (introduced in Perl 5.21.5, saying that [A-Z], [a-z], and [0-9], and their sub-ranges, are portable among character encodings) is actually true as far back as Perl 5.8.0.

Leave a comment

About Tom Wyant

user-pic I blog about Perl.