Unicode and Passwords
As I was doing some reading on Unicode, I had to sign up for a free account with ft.com site in order to read one of their articles. I normally use strong passwords, but this Web site presented me with the following error message:
Your password must be at least 6 characters long and include letters and numbers only
Ignoring the bad user interface — please tell me before I typed the damned password — it's also suggestive of security issues (ask Bobby for one reason why programmers have such bad password restrictions).
And that got me to thinking about Å, also known as U+212B.
So the first thing you want to do is run this little program:
use charnames ':short';
# stop the "Wide character in print warnings"
binmode STDOUT, ':encoding(UTF-8)';
print "\N{U+212B} \N{U+00C5} \N{U+0041}\N{U+030A}\n";
And that will print Å Å Å.
Note: if your system is so broken it can't view the letters above, they're the upper case A with a combining ring above
Even though those characters are different code points, under Unicode, they must be considered to be the same character. The Unicode::Collate module demonstrates how this is done with the Unicode Collation Algorithm, even though Perl's built-in cmp operator gets this wrong.
This raises an interesting (to me) an interesting question. What should passwords do? If I allow Unicode in passwords, should I allow U+212B when the original password had U+00C5? Is this, in theory, restricting the number of possible combinations someone can type or is the password space so huge here that it really doesn't matter? Are there other security implications here that I should be aware of?
A couple of comments on my code above for those not familiar with it. The utf8::all module was there because originally I was printing out those Unicode values and I didn't want the "Wide Character in Print" warnings. However, you must load that before Test::More because Test::Builder will dup your filehandles and you want to set their encoding layer prior to that duplication.
Alternatively, for those who hate the utf8::all module (or for those who prefer less magic), you can use this:
That tells Perl that your STDERR and STDOUT are both going to be expecting UTF-8 data. Thus, if you print any of $a1, $a2 or $a3 to either STDERR or STDOUT, you won't get those warnings.
One final note: don't use binmode STDOUT, ':utf8';. You see that a lot in example code, but it's wrong. It merely sets the layer as utf8 but doesn't validate it. See this perlmonks post for a proof of concept exploit. Whenever you see ':utf8', it's probably a bug and you should change it to ':encoding(UTF-8)'.
If your password comes from a browser, you have also the interesting problem to detect the encoding of the raw data you're receiving.
For your specific purpose, I'd choose a normal form (NFC probably, so stuff like the Dutch ij is not converted to i+j) and convert every Unicode input to that.
It's worth knowing that
binmode($fh, ':encoding(UTF-8)')
blows away$@
. However, not all encoding layers for binmode() blow away $@.You can hold on to your
$@
by tossing alocal $@;
before your call tobinmode()
.I second what Rafaël says. Normalize your input and you'll have less problems dealing with unicode. It's not just specific to passwords, but other things like searching too. But if you normalize correctly then those characters will be collapsed into the same thing.
NFKD or NFKC can be useful if you want very lenient comparisons, but I'd stick with NFD or NFC for password comparisons so as not to turn a strong password into a weak one. If normalization forms weren't used on passwords, it could potentially be a problem for a user travelling in another country who can't log in because the foreign keyboard mapping uses the same canonical character at a different code point.
One thing to consider is, what are your costs? Has anyone done research to find out whether allowing Unicode passwords results in an increase of customer service calls, because people are having problems? You may think passwords become more secure because of the increased key space, what if someone picks a password with a "smart-quote" (U+2019), when creating an account using his PC (not really realizing he's using "smart-quotes"). Then, later, while travelling, he tries to log in to your service using a mobile device, but the keyboard on that device has regular quotes (U+0027) handy, and "smart-quotes" hidden behind layers of menus. Now, your customer gets told his password is incorrect. You may lose sales this way, or even lose the customer. Or he may keeps a customer service agent occupied for 15 minutes. I'm not going to express an opinion on whether that's a price you should be willing to pay, but it is something one has to consider.
Actually, it's only a problem when used for input. For output it doesn't really matter (if your output is invalid UTF-8 already, you've got bigger problems to worry about).