The not-so-great escape

By Ben Bullock on July 4, 2017 5:40 PM

Escaping HTML is the process of converting a user's input into something which can be displayed back to the user in a web browser. For example, in a comment section on a blog, or a wiki editable by users.

Given user input such as <script>, to display that correctly, an HTML
escaper must output <script>. This is then converted into
<script> rather than an actual HTML script tag by the browser:

s/</</g;
s/>/>/g;

But supposing the user inputs <script>, what should be done with it?

If the <script> is not altered by the HTML escaper, then when it is displayed back to the user, it gets converted by the browser back into <script>, which is not what was intended, and even worse if the user tries to edit the comment again, the HTML tag may get removed from the text.

The solution to this problem is to also convert the ampersand, &, into an HTML entity, like this:

s/&/&/g;

This has to be done before the conversions of < and >.

s/&/&/g;
s/</</g;
s/>/>/g;

If we do it after, we get < converted into <, then into &lt;.

s/</</g;
s/>/>/g;
s/&/&/g;

The bug which occurs when ampersands are not escaped to & for display occurs with CPAN modules like HTML::Scrubber, and also with the CPAN ratings service.

5 comments

5 Comments

Toby Inkster | July 5, 2017 10:47 AM | Reply

" should probably be escaped too, in case the text is going to appear in an attribute value.

But just use HTML::Entities and forget about manually doing everything with regexes.

Ben Bullock replied to comment from Toby Inkster | July 5, 2017 12:46 PM | Reply

> But just use HTML::Entities and forget about manually doing everything with regexes.

It's probably better to use a module which does the job automatically, but what are you going to do when the module you use is wrong? I've written this article in response to finding bugs in various CPAN modules and in CPAN ratings.

For example, here is the bug in HTML::Scrubber:

https://metacpan.org/source/NIGELM/HTML-Scrubber-0.17/lib/HTML/Scrubber.pm#L301

Exactly the same bug occurs in CPAN ratings. I don't know what HTML entity replacement is used there. I reported the bug here:

https://github.com/perlorg/perlweb/issues/213

I originally found these bugs when writing a list of reviews of HTML cleanup modules, found here:

https://www.lemoda.net/perl/html-cleanup-modules/index.html

Aristotle replied to comment from Toby Inkster | July 6, 2017 11:43 PM | Reply

[The double-quote] should probably be escaped too, in case the text is going to appear in an attribute value.

And so should the single quote, for the same reason. So there are 5 characters in total that you need to escape.

For widest compatibility with all the various *ML languages, it is best to use decimal numeric entities.

And lastly, the ordering problems mentioned by Ben become irrelevant if you do all of the escapes in a single pass.

All put together:

s/([<>&'"])/'&#'.ord($1).';'/ge

Ben Bullock | July 7, 2017 7:29 AM | Reply

A couple of examples of modules which do the escaping correctly are

Mojo::Util

https://metacpan.org/source/SRI/Mojolicious-7.34/lib/Mojo/Util.pm#L314

and

HTML::Entities

https://metacpan.org/source/GAAS/HTML-Parser-3.72/lib/HTML/Entities.pm#L462

as mentioned by Toby Inkster above.

Ron Savage | July 7, 2017 9:12 AM | Reply

Hi Ben

About Ben Bullock

Perl user since about 2006, I have also released some CPAN modules.

More info »

The Incredible Journey