Be generous, but not very

This is another blog post based on my experiences of validating user inputs. Here is the previous one, where I argue against the use of "eval".

"Be generous in what you accept" says the adage. I want to argue that generosity works sometimes, but it's all too easy to be generous to a fault, and cheerfully responding to nonsense input is a losing battle. The problem with being too generous is that you end up correcting or even rejecting the slightly poor inputs in a flawed quest to make sense of complete drivel.

Let's take the case of processing English numbers in words as an example, because I have a web site which does this. It's quite easy to make a number parser which parses well-formed English words like "twenty-one" into numerals, "21", and a lot of people started out in programming doing simple exercises like that, in BASIC or something. The next step is converting somewhat broken but basically understandable inputs like "twentyone" or "fourty two" or "a milion" into numerals. These are clearly numbers alright, so it's just a case of someone whose spelling or punctuation is a bit shaky. So because we're using Perl, we shake a few regular expressions at the problem, like matching fou?rty or \b(b|m|tr)ill?ions?\b or something.

So far so good, but what happens when we descend into the maelstrom of trying to make sense out of absolutely any kind of input? Here are some genuine examples of rejected inputs to the above-mentioned number converter, randomly chosen:

what
thank you very much master
3'000,000
ichiro yamada
Kaku gidai ni mōke rareta mojisū no rūru ni shitagatte bunshō sakusei o okonatte kudasai
8 mili
1000 cherry trees
down
201; 202; 203; 204
diamond
thank
X767
one times ten to the fourty-eighth
ga be fuo n su
September 28, 2008
create account
3-Nen-me
forth
deb
6\30\2007

The problem you get when you try to be "generous" to these kinds of inputs is that you end up wrecking the mechanisms you made to fix up the moderately bad stuff. Here's an example: because it's a Western to Japanese converter, one common type of thing I get is people adding "en" or "yen" to the end of numbers. What happens when you try to allow for this? The problem is that "en" collides with "ten", so if I allow "a million en", then the user who types "twentyten" gets that misinterpreted as "twentyt en", and then gets muffed with an error message. This is the big problem with Lingua::EN::Numericalize, where it turns "america" into "americ1" in its desperate quest to convert everything into numbers, and it's why I had to stop using that.

Here are some 100% genuine inputs which I do accept, again randomly chosen:

thousend
zero , one
thirty-one
One thousand two hundred thirty four
573.4 million
one thousand amd sixty eight
FORTY SEVEN
1 billion, 51 million 855 thousand
Ninety-nine millions, nine hundred and ninety-nine thousands, nine hundred and ninety-nine
two thousand and eighteen
One thousand one hundred eleven
Eighty five thousand four hundred fifty five point eight
1.89 million
397.2 billion
12.5 million
50million300thiusand
seven bilion seven hundred thirteen million two thousand eightynine
one billion nine millions and thirty three
One hundred and twenty-three
5.6 billion

I also encountered this problem with the CPAN module JSON::Repair. I made the module initially with the hope that it would be able to process all kinds of things, like HJSON, into pure JSON, but I found that if I set it up with too extreme attempts to repair one kind of thing, that would end up breaking another part of the repairs. The only solution was to set it up to only do a modest number of repairs, to fairly unambiguous stuff. There is a little bit of documentation about that here.

There is probably some kind of information-theoretic statement of this conclusion in terms of error-correcting codes and redundancy and measures of spaces over spaces and whatnot, but I'm not completely sure what it is. But it's interesting that over-generous attempts to accept user inputs end up throwing the baby out with the bathwater, because it isn't something you find out until you have a big collection of user inputs, both valid and invalid, to test your program with.

Don't use something or another

Seems like a lot of people are keen on telling us not to use CGI.pm, but rather use something else. These discussions seem to verge on religious fervour, with each side finding small problems with CGI.pm or its alternatives, and then telling us that these small problems are actually the end of the world.

I don't use CGI.pm, I haven't used it for at least ten years, and I'm not about to defend it, but since we're all telling people not to use something, I thought I would chip in with something which I don't think you should use.

Since about 2006 I've been running a web site which offers to convert Japanese numbers into other kinds of numbers, and vice-versa. For most of those years until relatively recently I was using Lingua::JA::Numbers by Dan Kogai. Dan Kogai's module uses a methodology of converting the numbers by changing Japanese numbers into digits then sending the digits into an "eval" statement to compute the numeral value of the numbers:

https://metacpan.org/source/DANKOGAI/Lingua-JA-Numbers-0.05/lib/Lingua/JA/Numbers.pm#L375

I'd like to argue that the "eval" statement is impossible to use correctly even for this limited case, based on about twelve years of nearly-endless bugs.

The first problem is that to make sure that this eval statement works correctly, one has to validate the input sufficiently. The second problem is that, for whatever reason, people go to a web site which promises to convert Japanese numbers into Western numbers, and they type in their names, or addresses, or other random things. I recently computed the statistics for the site, and about twenty percent of the inputs over the last eight years (I don't have logs for the first four years) were just random characters or nonsense inputs. So before trying to convert these numbers, I had to first of all validate that they were numbers, and not someone's name or random ascii or something.

Although this validation sounds like a relatively simple task on the face of it, no matter what code I wrote to validate the numbers, someone would input some random thing which passed all of my validation tests, and break it. The final straw was some nonsense input which actually looked like "decimal point" "ten to the power twenty", and caused yet more errors.

Finally I came to this conclusion: I just don't think it's possible to validate the input fully before sending it to an eval statement without actually doing the entire computation, which makes the eval statement completely redundant. So I suggest that if you're making a module and you think "eval" might be a good trick to do something, you might want to think again.

Testing coverage of SvTRUE

Here is the coverage of Faster.xs in the module Gzip::Faster:

http://cpancover.com/latest/Gzip-Faster-0.21/Faster-xs--branch.html

It suggests that most of these statements aren't tested. But actually they are, more or less.

What seems to be happening is that SvTRUE is a macro with about five or six different tests:

https://github.com/Perl/perl5/blob/blead/sv.h#L1761

and so to get "coverage" of all the SvTRUEs here I'd need to send a string, an integer, a floating-point number, and so on and so on. But would that actually tell anything about Gzip::Faster? In fact it wouldn't. It would just be testing whether SvTRUE from Perl's core was working correctly or not.

Improving coverage with metacpan

I've been reviewing the coverage of the tests of modules using metacpan.org.

It is pretty handy for finding stuff which is not tested.

I tried it on this module:

https://metacpan.org/release/Directory-Diff

I noticed that a lot of the code had no tests, and there were also some completely unused subroutines.

coverage-before.png

I removed some of the unused subroutines and wrote some tests for the remaining things, and was able to improve the coverage:

coverage-after.png

This seems quite handy.

I'm fully covered

There seems to be a new feature of showing the coverage in metacpan. Fortunately I have already achieved 100% coverage here:

https://metacpan.org/release/Acme-Include-Data

Some other places are not so fortunate.

http://cpancover.com/latest/JSON-Repair-0.07/blib-lib-JSON-Repair-pm--subroutine.html

I only have 33% coverage in my POD.

The thing is though that I deliberately didn't document these private routines.

I looked in vain through what documentation I could find for this coverage system

https://metacpan.org/pod/Pod::Coverage

to find out how to tell it that these routines were never meant to be documented, but couldn't find anything except for a regular expression which ignores routines with a leading underscore.

Can anyone tell me how to tell the coverage meter to not measure these routines?