Be generous, but not very

By Ben Bullock on November 22, 2018 1:56 PM

This is another blog post based on my experiences of validating user inputs. Here is the previous one, where I argue against the use of "eval".

"Be generous in what you accept" says the adage. I want to argue that generosity works sometimes, but it's all too easy to be generous to a fault, and cheerfully responding to nonsense input is a losing battle. The problem with being too generous is that you end up correcting or even rejecting the slightly poor inputs in a flawed quest to make sense of complete drivel.

Let's take the case of processing English numbers in words as an example, because I have a web site which does this. It's quite easy to make a number parser which parses well-formed English words like "twenty-one" into numerals, "21", and a lot of people started out in programming doing simple exercises like that, in BASIC or something. The next step is converting somewhat broken but basically understandable inputs like "twentyone" or "fourty two" or "a milion" into numerals. These are clearly numbers alright, so it's just a case of someone whose spelling or punctuation is a bit shaky. So because we're using Perl, we shake a few regular expressions at the problem, like matching fou?rty or \b(b|m|tr)ill?ions?\b or something.

So far so good, but what happens when we descend into the maelstrom of trying to make sense out of absolutely any kind of input? Here are some genuine examples of rejected inputs to the above-mentioned number converter, randomly chosen:

what
thank you very much master
3'000,000
ichiro yamada
Kaku gidai ni mōke rareta mojisū no rūru ni shitagatte bunshō sakusei o okonatte kudasai
8 mili
1000 cherry trees
down
201; 202; 203; 204
diamond
thank
X767
one times ten to the fourty-eighth
ga be fuo n su
September 28, 2008
create account
3-Nen-me
forth
deb
6\30\2007

The problem you get when you try to be "generous" to these kinds of inputs is that you end up wrecking the mechanisms you made to fix up the moderately bad stuff. Here's an example: because it's a Western to Japanese converter, one common type of thing I get is people adding "en" or "yen" to the end of numbers. What happens when you try to allow for this? The problem is that "en" collides with "ten", so if I allow "a million en", then the user who types "twentyten" gets that misinterpreted as "twentyt en", and then gets muffed with an error message. This is the big problem with Lingua::EN::Numericalize, where it turns "america" into "americ1" in its desperate quest to convert everything into numbers, and it's why I had to stop using that.

Here are some 100% genuine inputs which I do accept, again randomly chosen:

thousend
zero , one
thirty-one
One thousand two hundred thirty four
573.4 million
one thousand amd sixty eight
FORTY SEVEN
1 billion, 51 million 855 thousand
Ninety-nine millions, nine hundred and ninety-nine thousands, nine hundred and ninety-nine
two thousand and eighteen
One thousand one hundred eleven
Eighty five thousand four hundred fifty five point eight
1.89 million
397.2 billion
12.5 million
50million300thiusand
seven bilion seven hundred thirteen million two thousand eightynine
one billion nine millions and thirty three
One hundred and twenty-three
5.6 billion

I also encountered this problem with the CPAN module JSON::Repair. I made the module initially with the hope that it would be able to process all kinds of things, like HJSON, into pure JSON, but I found that if I set it up with too extreme attempts to repair one kind of thing, that would end up breaking another part of the repairs. The only solution was to set it up to only do a modest number of repairs, to fairly unambiguous stuff. There is a little bit of documentation about that here.

There is probably some kind of information-theoretic statement of this conclusion in terms of error-correcting codes and redundancy and measures of spaces over spaces and whatnot, but I'm not completely sure what it is. But it's interesting that over-generous attempts to accept user inputs end up throwing the baby out with the bathwater, because it isn't something you find out until you have a big collection of user inputs, both valid and invalid, to test your program with.

2 comments

2 Comments

Ron Savage | November 26, 2018 3:35 PM | Reply

Hi Ben

You might want to try Text::Levenshtein or similar. Search MetaCPAN for Levenshtein.

Ben Bullock replied to comment from Ron Savage | November 26, 2018 4:17 PM | Reply

Hi Ron,

I was just thinking the other day that I hadn't seen you for a while here.

Those Levenshtein things are OK for approximate work, but they tend to give lots of false positives. I actually did extensive work on spelling correction for another related web page which converts English words into Japanese forms, but I found that the Levenshtein corrections, even with a maximum distance of one, tend to throw up so many false positives that I'm not sure it's worth the bother. This is partly because of the amount of nonsense that people type in, and partly because it tends to do things like correct "Alisdair" to "Alistair", thus annoying all the "Alisdairs" out there. Similarly with lots of "Cherril" or "Candyce" type names. I actually encourage people to type their names in, I think it could get annoying if I keep saying "do you mean 'Candice'?". I didn't put the English correction discussion into this blog post since I thought the post would get too long, but perhaps I should prepare part 2.

Anyway, actually I hand-crafted error correction based on the old inputs. Lots of "amd" for "and", etc., so I just coded that as a hand correction, like s/\bamd\b/and/, and about a thousand different ways to spell "thousand" and "million/billion". Then a few days after that I got "bollion" for "billion" which I'd never seen before.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ben Bullock

Perl user since about 2006, I have also released some CPAN modules.

More info »

The Incredible Journey