The not-so-great escape
Escaping HTML is the process of converting a user's input into something which can be displayed back to the user in a web browser. For example, in a comment section on a blog, or a wiki editable by users.
Given user input such as <script>, to display that correctly, an HTML
escaper must output <script>. This is then converted into
<script> rather than an actual HTML script tag by the browser:
s/</</g;
s/>/>/g;
But supposing the user inputs <script>, what should be done with it?
If the <script> is not altered by the HTML escaper, then when it is displayed back to the user, it gets converted by the browser back into <script>, which is not what was intended, and even worse if the user tries to edit the comment again, the HTML tag may get removed from the text.
The solution to this problem is to also convert the ampersand, &, into an HTML entity, like this:
s/&/&/g;
This has to be done before the conversions of < and >.
s/&/&/g;
s/</</g;
s/>/>/g;
If we do it after, we get < converted into <, then into &lt;.
s/</</g;
s/>/>/g;
s/&/&/g;
The bug which occurs when ampersands are not escaped to & for display occurs with CPAN modules like HTML::Scrubber, and also with the CPAN ratings service.
"
should probably be escaped too, in case the text is going to appear in an attribute value.But just use HTML::Entities and forget about manually doing everything with regexes.
> But just use HTML::Entities and forget about manually doing everything with regexes.
It's probably better to use a module which does the job automatically, but what are you going to do when the module you use is wrong? I've written this article in response to finding bugs in various CPAN modules and in CPAN ratings.
For example, here is the bug in HTML::Scrubber:
https://metacpan.org/source/NIGELM/HTML-Scrubber-0.17/lib/HTML/Scrubber.pm#L301
Exactly the same bug occurs in CPAN ratings. I don't know what HTML entity replacement is used there. I reported the bug here:
https://github.com/perlorg/perlweb/issues/213
I originally found these bugs when writing a list of reviews of HTML cleanup modules, found here:
https://www.lemoda.net/perl/html-cleanup-modules/index.html
And so should the single quote, for the same reason. So there are 5 characters in total that you need to escape.
For widest compatibility with all the various *ML languages, it is best to use decimal numeric entities.
And lastly, the ordering problems mentioned by Ben become irrelevant if you do all of the escapes in a single pass.
All put together:
A couple of examples of modules which do the escaping correctly are
Mojo::Util
https://metacpan.org/source/SRI/Mojolicious-7.34/lib/Mojo/Util.pm#L314
and
HTML::Entities
https://metacpan.org/source/GAAS/HTML-Parser-3.72/lib/HTML/Entities.pm#L462
as mentioned by Toby Inkster above.
Hi Ben
See also HTML::Entities::Interpolate.