Untrusted Numeric Input -- /[0-9]/
My blog entry of a couple weeks ago,
Untrusted Numeric Input,
dealt mainly with the problem of ensuring that supposedly-numeric input
actually consisted only of ASCII digits. One of the ways to do this was
to use the bracketed character class [0-9]
instead of
\d
. This was documented as being portable as of Perl 5.21.5,
and I made the statement that "I believe this behavior goes back further
..." This was clearly just hopeful hand-waving, and not very
helpful.
This blog post documents my efforts to try to quantify the
versions of Perl under which [0-9]
is portable. For those
disinclined to read further, my conclusion is that [A-Z]
,
[a-z]
, [0-9]
, and their sub-ranges are
portable among character sets as far back as Perl 5.8.0.
Since I do not have a non-ASCII environment at my disposal for testing, the next best way to investigate was to read the Perl code. I make no claim to an exact understanding of any part of the Perl source, but in this case I believe I was able to come to some useful conclusions.
The relevant source module appears to be regcomp.c
,
which compiles regular expressions. I dipped into this module in blead
(as of January 21 2018, commit
129da27a6cbce7395f91b30779d708adc609e29c) (about line 17546),
perl-5.10.0
(about line 8098), and
perl-5.8.0
(about line 4096). I did not go any farther back because of earlier
versions' lack of support for character sets other than native.
Each of these versions of Perl had code that was conditionalized on symbol EBCDIC being defined. For Unix-like systems this is defined (or not) by the Configure script. The assumption here is that if the encoding is not some variant or extension of ASCII, it must be EBCDIC. This is false in general, but is probably true if we restrict ourselves to the encodings used by operating systems that will actually run Perl.
Now my memory is that EBCDIC is not a single encoding, but a family of them that differ in subtle ways from each other. Google turned up a bunch of EBCDIC code tables, of which I have to consider IBM's authoritative if any of them is. I did not compare them in detail, but all shared the following features:
- The upper-case alphabetic code points were monotonic (that is,
the values of the code points increased as you went forward
through the alphabet) but discontiguous, with a gap between "I"
and "J", and another between "R" and "S". The gaps contained
non-alphabetics. The gap between "R" and "S" is used by
Configure
to detect EBCDIC. - The lower-case alphabetic code points had the same features as the upper-case code points (i.e. monotonic and discontiguous at the same points), and did not overlap them.
- The numeric code points were consecutive -- that is, both monotonic and contiguous.
Getting back to regcomp.c
, I found that in each of the
versions I examined, if symbol EBCDIC
was defined, there
was special-case handling of bracketed character classes
[A-Z]
, [a-z]
, and sub-ranges thereof. This
special-case code iterated over all encodings in the range, but only
included in the compiled character class encodings that were uppercase
(or lowercase) alphabetics as determined by the C language character
classification functions/macros. Given this special-case code, it looks
to me like Perl's implementation of [A-Z]
,
[a-z]
, and sub-ranges thereof is portable between ASCII and
EBCDIC (and supersets of them) as far back as Perl 5.8.0.
Without this specal case code, bracketed character classes were
considered to consist of all code points between the beginning and the
end of the class, inclusive. Interestingly, this means that
[0-9]
is portable among ASCII, EBCDIC, and any encoding
where the code points for the digits are consecutive, even without special-case code.
I attempted to test my conclusions about the alphabetics by building
Perl 5.8.0, editing configure.sh
so that
EBCDIC
was defined. I failed to get this to link, but I
was able to build a normal Perl, with regcomp.c
edited to
define EBCDIC
and to print a debug message when the
special-case code was encountered. This printed the message when
compiling /[A-Z]/
, so it looks like the special-case code
is used when I think it is.
All of this seems to say that the statement in
perlrecharclass
(introduced in Perl 5.21.5, saying that [A-Z]
, [a-z]
,
and [0-9]
, and their sub-ranges, are portable among
character encodings) is actually true as far back as Perl 5.8.0.
Leave a comment