XML::LibXML parse_html_string iframe games

Given HTML with certain "empty" tags that you wanted to manipulate via something like:


my $html = '<p><iframe src="..."></iframe></p>';
my $doc = XML::LibXML->new->parse_html_string($html);
# do stuff with $doc
$doc->toString();

You would end up with:

<p><iframe src="..."/></p>';

Namely, the "empty" iframe tag is going to get output as a single, self-closing tag.

But that's not valid HTML. Not even valid HTML5.

XML::LibXML::Parser has the setTagCompression option but this is no good here.

<hr/>, <br/> and <img src="..." alt="..."/>

cannot be written as:

<hr></hr>, <br></br> and <img src="..." alt="..."></img>

at least, not if you want valid HTML.

I hit upon the idea of appending a single space:


for ($root->findnodes('//iframe')) {
$_->appendChild(XML::LibXML::Text->new(' '))
if !$_->hasChildNodes;
}

which works, because an extra space inside a previously "empty" iframe, script or canvas tag would be harmless. However this approach will cause problems when you come across "empty" <textarea> tags.

<textarea></textarea>

is not the same as:

<textarea> </textarea>

The solution that seems to work is:


for ($root->findnodes('//iframe')) {
$_->appendChild(XML::LibXML::Text->new(''))
if !$_->hasChildNodes;
}

Namely, append a child node, that represents the empty string. It is enough to convince XML::LibXML->toString() that the node has a child and so it should not try and "compress" the node down to a self-closing tag.

When it comes to emitting that child via toString(), it's an empty textstring, so nothing is output.

ps. yes, I know there are other ways to parse HTML and that HTML != XML.

2 Comments

See HTML::HTML5::Writer. (Then HTML::HTML5::Parser while you’re at it.)

You pass HTML to a XML library - why do you expect it to treat it other than specified?
Maybe have a look at XML::LibXSLT and ...

Leave a comment

About minty

user-pic I blog about Perl.