XML::LibXML parse_html_string iframe games
Given HTML with certain "empty" tags that you wanted to manipulate via something like:
my $html = '<p><iframe src="..."></iframe></p>';
my $doc = XML::LibXML->new->parse_html_string($html);
# do stuff with $doc
$doc->toString();
You would end up with:
<p><iframe src="..."/></p>';
Namely, the "empty" iframe tag is going to get output as a single, self-closing tag.
But that's not valid HTML. Not even valid HTML5.
XML::LibXML::Parser has the setTagCompression option but this is no good here.
<hr/>, <br/> and <img src="..." alt="..."/>
cannot be written as:
<hr></hr>, <br></br> and <img src="..." alt="..."></img>
at least, not if you want valid HTML.
I hit upon the idea of appending a single space:
for ($root->findnodes('//iframe')) {
$_->appendChild(XML::LibXML::Text->new(' '))
if !$_->hasChildNodes;
}
which works, because an extra space inside a previously "empty" iframe, script or canvas tag would be harmless. However this approach will cause problems when you come across "empty" <textarea> tags.
<textarea></textarea>
is not the same as:
<textarea> </textarea>
The solution that seems to work is:
for ($root->findnodes('//iframe')) {
$_->appendChild(XML::LibXML::Text->new(''))
if !$_->hasChildNodes;
}
Namely, append a child node, that represents the empty string. It is enough to convince XML::LibXML->toString() that the node has a child and so it should not try and "compress" the node down to a self-closing tag.
When it comes to emitting that child via toString(), it's an empty textstring, so nothing is output.
ps. yes, I know there are other ways to parse HTML and that HTML != XML.
See HTML::HTML5::Writer. (Then HTML::HTML5::Parser while you’re at it.)
You pass HTML to a XML library - why do you expect it to treat it other than specified?
Maybe have a look at XML::LibXSLT and ...