Testing distributions for potentially malicious Unicode
I was inspired by Daniel Stenberg's recent article Detecting malicious Unicode to write Test::MixedScripts, which tests Perl source code and other text files for unexpected Unicode scripts.
Why should you care about this?
There are Unicode characters in different scripts (alphabets) that look similar and are easily confused.
A malicious person could replace a domain name or other important token with one that looks correct, for is associated with a host or other resource that they control.
Consider the two domain names, "оnе.example.com" and "one.example.com". They look indistinguishable in many fonts, but the first one has Cyrillic letters.
Confusing Unicode might be added to your codebase through a malicious patch submission or pull request. Or it could be added as text from an email or web page that you copied and pasted into your code.
The module is easy to use, and defaults to testing for Latin and Common characters:
use Test::V0;
use Test::MixedScripts v0.3.0 qw/ all_perl_files_scripts_ok /;
all_perl_files_scripts_ok();
done_testing;
If you ran this test against a file with the URLs in the above example, you would see an error such as
Unexpected Cyrillic character on line 11 character 32
If your code has Cyrillic characters then you can add line- or region-specific notations, for example
my $host = "оnе.example.com"; ## Text::MixedScripts Common,Cyrillic,Latin
or you could use this for the entire codebase
all_perl_files_scripts_ok( { scripts => [qw/ Common Cyrillic Latin /] } );
You can also test specific non-Perl files:
file_scripts_ok( "Makefile" );
file_scripts_ok( "bin/service.sh" );
file_scripts_ok( "assets/script.js" );
file_scripts_ok( "assets/style.css" );
file_scripts_ok( "templates/index.tmpl" );
There's also a Dist::Zilla::Plugin::Test::MixedScripts to generate an author test for Dist::Zilla-managed distributions.
This is a new new project, so there are likely bugs. But please give it a try, especially if you work on modules with mixed scripts in the codebase.
Leave a comment