October 2013 Archives

What about HTML to Markdown?

Neil Bowers released a survey of markdown to HTML formatters recently. I thought it was an interesting coincidence, because I have recently written a CPAN library to go the opposite way, from HTML to Markdown.

For various and sundry reasons I wanted to move my blog from a Wordpress installation to a static blog where the post content is represented as markdown, but there were (to my complete astonishment) no CPAN modules to convert HTML to markdown, so I decided to write one based on HTML::Format.

In general, I was surprised by the lack of tools (in any language) to convert Wordpress exports into markdown, but now we have something for Perl. I was pleasantly surprised how quick and straightforward it was to implement the converter. If you have a need to convert HTML into format X, give HTML::Format some serious consideration as the base platform to do that work.

Over the weekend my new module was merged and released to CPAN by the HTML::Format maintainer. The driver script for the WordPress to Markdown conversion is here. I may revise my driver script to put post metadata into TOML but I haven't done that yet mostly because the static blog engine is still under construction so the exact post format requirements are still unstable.

I used a fairly good sized corpus of posts as tests and had good results but more tests are always welcome.

About Mark Allen

user-pic Singer, dad, nerd, not necessarily in that order. @bytemeorg