Be generous, but not very
This is another blog post based on my experiences of validating user inputs. Here is the previous one, where I argue against the use of "eval".
"Be generous in what you accept" says the adage. I want to argue that generosity works sometimes, but it's all too easy to be generous to a fault, and cheerfully responding to nonsense input is a losing battle. The problem with being too generous is that you end up correcting or even rejecting the slightly poor inputs in a flawed quest to make sense of complete drivel.
Let's take the case of processing English numbers in words as an
example, because I have a web site which does this. It's quite easy to
make a number parser which parses well-formed English words like
"twenty-one" into numerals, "21", and a lot of people started out in
programming doing simple exercises like that, in BASIC or
something. The next step is converting somewhat broken but basically
understandable inputs like "twentyone" or "fourty two" or "a milion"
into numerals. These are clearly numbers alright, so it's just a case
of someone whose spelling or punctuation is a bit shaky. So because
we're using Perl, we shake a few regular expressions at the problem,
\b(b|m|tr)ill?ions?\b or something.
So far so good, but what happens when we descend into the maelstrom of trying to make sense out of absolutely any kind of input? Here are some genuine examples of rejected inputs to the above-mentioned number converter, randomly chosen:
what thank you very much master 3'000,000 ichiro yamada Kaku gidai ni mōke rareta mojisū no rūru ni shitagatte bunshō sakusei o okonatte kudasai 8 mili 1000 cherry trees down 201; 202; 203; 204 diamond thank X767 one times ten to the fourty-eighth ga be fuo n su September 28, 2008 create account 3-Nen-me forth deb 6\30\2007
The problem you get when you try to be "generous" to these kinds of inputs is that you end up wrecking the mechanisms you made to fix up the moderately bad stuff. Here's an example: because it's a Western to Japanese converter, one common type of thing I get is people adding "en" or "yen" to the end of numbers. What happens when you try to allow for this? The problem is that "en" collides with "ten", so if I allow "a million en", then the user who types "twentyten" gets that misinterpreted as "twentyt en", and then gets muffed with an error message. This is the big problem with Lingua::EN::Numericalize, where it turns "america" into "americ1" in its desperate quest to convert everything into numbers, and it's why I had to stop using that.
Here are some 100% genuine inputs which I do accept, again randomly chosen:
thousend zero , one thirty-one One thousand two hundred thirty four 573.4 million one thousand amd sixty eight FORTY SEVEN 1 billion, 51 million 855 thousand Ninety-nine millions, nine hundred and ninety-nine thousands, nine hundred and ninety-nine two thousand and eighteen One thousand one hundred eleven Eighty five thousand four hundred fifty five point eight 1.89 million 397.2 billion 12.5 million 50million300thiusand seven bilion seven hundred thirteen million two thousand eightynine one billion nine millions and thirty three One hundred and twenty-three 5.6 billion
I also encountered this problem with the CPAN module JSON::Repair. I made the module initially with the hope that it would be able to process all kinds of things, like HJSON, into pure JSON, but I found that if I set it up with too extreme attempts to repair one kind of thing, that would end up breaking another part of the repairs. The only solution was to set it up to only do a modest number of repairs, to fairly unambiguous stuff. There is a little bit of documentation about that here.
There is probably some kind of information-theoretic statement of this conclusion in terms of error-correcting codes and redundancy and measures of spaces over spaces and whatnot, but I'm not completely sure what it is. But it's interesting that over-generous attempts to accept user inputs end up throwing the baby out with the bathwater, because it isn't something you find out until you have a big collection of user inputs, both valid and invalid, to test your program with.
You might want to try Text::Levenshtein or similar. Search MetaCPAN for Levenshtein.
I was just thinking the other day that I hadn't seen you for a while here.
Those Levenshtein things are OK for approximate work, but they tend to give lots of false positives. I actually did extensive work on spelling correction for another related web page which converts English words into Japanese forms, but I found that the Levenshtein corrections, even with a maximum distance of one, tend to throw up so many false positives that I'm not sure it's worth the bother. This is partly because of the amount of nonsense that people type in, and partly because it tends to do things like correct "Alisdair" to "Alistair", thus annoying all the "Alisdairs" out there. Similarly with lots of "Cherril" or "Candyce" type names. I actually encourage people to type their names in, I think it could get annoying if I keep saying "do you mean 'Candice'?". I didn't put the English correction discussion into this blog post since I thought the post would get too long, but perhaps I should prepare part 2.
Anyway, actually I hand-crafted error correction based on the old inputs. Lots of "amd" for "and", etc., so I just coded that as a hand correction, like s/\bamd\b/and/, and about a thousand different ways to spell "thousand" and "million/billion". Then a few days after that I got "bollion" for "billion" which I'd never seen before.