The not-so-great escape

Escaping HTML is the process of converting a user's input into something which can be displayed back to the user in a web browser. For example, in a comment section on a blog, or a wiki editable by users.

Given user input such as <script>, to display that correctly, an HTML
escaper must output &lt;script&gt;. This is then converted into
<script> rather than an actual HTML script tag by the browser:


But supposing the user inputs &lt;script&gt;, what should be done with it?

If the &lt;script&gt; is not altered by the HTML escaper, then when it is displayed back to the user, it gets converted by the browser back into <script>, which is not what was intended, and even worse if the user tries to edit the comment again, the HTML tag may get removed from the text.

The solution to this problem is to also convert the ampersand, &, into an HTML entity, like this:


This has to be done before the conversions of < and >.


If we do it after, we get < converted into &lt;, then into &amp;lt;.


The bug which occurs when ampersands are not escaped to &amp; for display occurs with CPAN modules like HTML::Scrubber, and also with the CPAN ratings service.


" should probably be escaped too, in case the text is going to appear in an attribute value.

But just use HTML::Entities and forget about manually doing everything with regexes.

[The double-quote] should probably be escaped too, in case the text is going to appear in an attribute value.

And so should the single quote, for the same reason. So there are 5 characters in total that you need to escape.

For widest compatibility with all the various *ML languages, it is best to use decimal numeric entities.

And lastly, the ordering problems mentioned by Ben become irrelevant if you do all of the escapes in a single pass.

All put together:


Leave a comment

About Ben Bullock

user-pic I blog about Perl.