Picking a better Markdown library for bad input

I was handling some bad Markdown input using Text::Markdown, when I saw it generate broken HTML.

I started with (bad) Markdown input " 1. z\n >" and got back HTML <p><ol>\n<li>z</p>\n\n<blockquote>\n <p></li>\n </ol></p>\n</blockquote>.

(See the incorrectly nested HTML tags, <p><ol><li></p>?)

So I tried feeding this bad Markdown to four different Perl Markdown libraries: Text::Markdown, Text::MultiMarkdown, Text::Markdown::Discount, and Markdent, to see which one would give me valid HTML.

The results?

  • Text::Markdown — invalid HTML <p><ol>\n<li>z</p>\n\n<blockquote>\n <p></li>\n </ol></p>\n</blockquote>\n

  • Text::MultiMarkdown — invalid HTML <p><ol>\n<li>z</p>\n\n<blockquote>\n <p></li>\n </ol></p>\n</blockquote>\n

  • Text::Markdown::Discount — valid HTML! <ol>\n<li> z\n\n<blockquote></blockquote></li>\n</ol>\n\n

  • Markdent — valid HTML, but doesn't generate a simple HTML fragment <!DOCTYPE html>\n<html><head><title></title></head><body><ol><li>z\n &gt;\n</li></ol></body></html>

The solution? Switch from Text::Markdown to Text::Markdown::Discount.

3 Comments

I'm not so sure that Markdown is invalid. Are you? Markdown is notoriously non-specific about all sorts of things.

I think Markdent does the right thing here. As far as it producing a document, it has two classes to produce HTML, Markdent::Handler::HTMLStream::Document and Markdent::Handler::HTMLStream::Fragment. I'm guessing you used the former when you wanted the latter.

(I am not including less-than and greater-than symbols in this comment.)

p is a block level marker and ol is a block level marker. There is no such thing as a block within a block, so p ol /p is invalid HTML. If you are interested please see the following discussion:

https://metacpan.org/pod/HTML::Valid::Tagset#Issues-with-HTML::Tagset

I switched to using Pandoc for all my Markdonw processing. Haven't seen a problem yet.

Leave a comment

About Anirvan

user-pic I blog about Perl.