Use STRLEN not int for SvPV

Obscure bugs occur with the following type of code:

 unsigned int len;
 c = SvPV (sv, len);

The bugs occur typically on a 64 bit system. They happen because unsigned int may be a 32 bit integer, but the second argument to SvPV should be STRLEN, which is unsigned long int. Giving a pointer to a 32-bit integer where it expects a 64-bit integer causes some very odd bugs, and may even crash the interpreter. So, one has to always do like this:

 STRLEN len;
 c = SvPV (sv, len);

and never use anything which is not STRLEN type.

I have a collection of more weird and wonderful XS bugs, found through CPAN testers, here:

Despite having known about this for years, I just found another instance in my own module, thanks to the warning messages from clang, in Text::Fuzzy:

I've just now updated it:

Perhaps it would be worth making some kind of automated checker to go through XS code and make sure the second argument to strlen is always STRLEN.

Always use `const char *` to refer to the return value from SvPV

Always use const char * to refer to the return value from SvPV.

Yesterday I got a bug report from a user via Github about Text::Fuzzy.

The bug report described that in some cases, when the user searched for an edit distance with Unicode strings, the user's input value, $string in the following, seemed to be being overwritten and corrupted:

$tf->distance ($string);

I couldn't reproduce the user's bug using the script he supplied, but just in case, I went through the code and tried to find anywhere that a string might be being overwritten, by adding const in front of every char * pointer which was used to store a Perl string.

This led me to this line where the value corresponding to $string in the above is read using SvPV, and this line where the value pointed to is overwritten by the code. This is a special case which only executes when the user matches a byte string against a character string.

As a fix for the bug, I changed to using allocated memory after the test for Unicode, and added a field allocated to the tf->b and set it to true or false so that the allocated memory could be freed. In a later commit I also added a test that the bug was fixed.

However, it would have been better if I had never allocated the return value from SvPV into a char * but always used a const char *.

According to Ken Thompson,

Const only confuses library interfaces with the hope of catching some rare errors

(source) but I'm not sure I agree with him.

The not-so-great escape

Escaping HTML is the process of converting a user's input into something which can be displayed back to the user in a web browser. For example, in a comment section on a blog, or a wiki editable by users.

Given user input such as <script>, to display that correctly, an HTML
escaper must output &lt;script&gt;. This is then converted into
<script> rather than an actual HTML script tag by the browser:


But supposing the user inputs &lt;script&gt;, what should be done with it?

\d does not validate numbers

points us to this Perl FAQ:

Unfortunately, the regular expression part of the above FAQ page is wrong. \d doesn't validate numbers, unless you have already verified that your input contains only ASCII characters.

What \d does is to validate whether a number is regarded as a numeral in Unicode. For example, \d will happily match things like U+07C2: '߂' NKO …