Unicode and Passwords

As I was doing some reading on Unicode, I had to sign up for a free account with ft.com site in order to read one of their articles. I normally use strong passwords, but this Web site presented me with the following error message:

Your password must be at least 6 characters long and include letters and numbers only

Ignoring the bad user interface — please tell me before I typed the damned password — it's also suggestive of security issues (ask Bobby for one reason why programmers have such bad password restrictions).

And that got me to thinking about Å, also known as U+212B.

So the first thing you want to do is run this little program:

use charnames ':short';

# stop the "Wide character in print warnings"
binmode STDOUT, ':encoding(UTF-8)';

print "\N{U+212B} \N{U+00C5} \N{U+0041}\N{U+030A}\n";

And that will print Å Å Å.

Note: if your system is so broken it can't view the letters above, they're the upper case A with a combining ring above

Even though those characters are different code points, under Unicode, they must be considered to be the same character. The Unicode::Collate module demonstrates how this is done with the Unicode Collation Algorithm, even though Perl's built-in cmp operator gets this wrong.

This raises an interesting (to me) an interesting question. What should passwords do? If I allow Unicode in passwords, should I allow U+212B when the original password had U+00C5? Is this, in theory, restricting the number of possible combinations someone can type or is the password space so huge here that it really doesn't matter? Are there other security implications here that I should be aware of?


If your password comes from a browser, you have also the interesting problem to detect the encoding of the raw data you're receiving.

For your specific purpose, I'd choose a normal form (NFC probably, so stuff like the Dutch ij is not converted to i+j) and convert every Unicode input to that.

It's worth knowing that binmode($fh, ':encoding(UTF-8)') blows away $@. However, not all encoding layers for binmode() blow away $@.

foreach my $layer (':raw', ':utf8', ':encoding(UTF-8)', ':encoding(utf8)') {
  eval { die $layer . "\n" };
  binmode STDOUT, $layer;
  if ($@) {
    chomp $@;
    print "kept \$\@: $@ for $layer\n";
  } else {
    print "lost \$\@ for $layer\n";
kept $@: :raw for :raw
kept $@: :utf8 for :utf8
lost $@ for :encoding(UTF-8)
lost $@ for :encoding(utf8)

You can hold on to your $@ by tossing a local $@; before your call to binmode().

I second what Rafaël says. Normalize your input and you'll have less problems dealing with unicode. It's not just specific to passwords, but other things like searching too. But if you normalize correctly then those characters will be collapsed into the same thing.

NFKD or NFKC can be useful if you want very lenient comparisons, but I'd stick with NFD or NFC for password comparisons so as not to turn a strong password into a weak one. If normalization forms weren't used on passwords, it could potentially be a problem for a user travelling in another country who can't log in because the foreign keyboard mapping uses the same canonical character at a different code point.

One thing to consider is, what are your costs? Has anyone done research to find out whether allowing Unicode passwords results in an increase of customer service calls, because people are having problems? You may think passwords become more secure because of the increased key space, what if someone picks a password with a "smart-quote" (U+2019), when creating an account using his PC (not really realizing he's using "smart-quotes"). Then, later, while travelling, he tries to log in to your service using a mobile device, but the keyboard on that device has regular quotes (U+0027) handy, and "smart-quotes" hidden behind layers of menus. Now, your customer gets told his password is incorrect. You may lose sales this way, or even lose the customer. Or he may keeps a customer service agent occupied for 15 minutes. I'm not going to express an opinion on whether that's a price you should be willing to pay, but it is something one has to consider.

One final note: don't use binmode STDOUT, ':utf8';. You see that a lot in example code, but it's wrong. It merely sets the layer as utf8 but doesn't validate it. See this perlmonks post for a proof of concept exploit. Whenever you see ':utf8', it's probably a bug and you should change it to ':encoding(UTF-8)'.

Actually, it's only a problem when used for input. For output it doesn't really matter (if your output is invalid UTF-8 already, you've got bigger problems to worry about).

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/