HTML-Tree 5: Now with weakref support
HTML-Tree has long been a source of memory leaks for programmers who weren’t very careful with it. Because it uses circular references, Perl’s reference-counting garbage collector can’t clean it up if you forget to call
$tree->delete when you’re done.
Perl added weak references (a.k.a. “weakrefs”) to resolve this problem, but HTML-Tree has never taken advantage of them. Until now.
HTML-Tree 5.00 (just released to CPAN) uses weak references by default. This means that when a tree goes out of scope, it gets deleted whether you called
$tree->delete or not. This should eliminate memory leaks caused by HTML-Tree.
Unfortunately, it can also break code that was working. Even though that code probably leaked memory, that’s not a big problem with a short-running script. The one real-world example I’ve found so far is pQuery’s dom.t. In pQuery 0.08, it does:
my @elems = pQuery::DOM->fromHTML('<div>xxx<!-- yyy -->zzz</div>') ->childNodes; my $comment = $elems; is $comment->parentNode->tagName, 'DIV', 'Comment has parentNode';
Notice that it’s not saving the result of the
fromHTML call; only the
child nodes. Since children now have only a weak reference to their
parent, the root node is deleted immediately, and
This can be fixed by saving a reference to the root node:
my @elems = (my $r = pQuery::DOM ->fromHTML('<div>xxx<!-- yyy -->zzz</div>')) ->childNodes;
As a quick fix for broken code (and to determine whether it’s the weak references that are causing the breakage), you can say:
use HTML::Element -noweak;
This (globally) disables HTML-Tree’s use of weak references. But this is just a temporary measure. You need to fix your code, because this feature will be going away eventually.
If you want to ensure that weak references are enabled, you can say:
use HTML::Element 5 -weak;
(It is necessary to include the version number, because previous versions of HTML-Tree simply ignored the import list.)
The next major change I’m planning for HTML-Tree is to make
parse_file use IO::HTML by default. Right now, it opens files in binary mode, which means that it doesn’t do the right thing when the file isn’t ISO-8859-1. IO::HTML uses the HTML5 encoding sniffing algorithm to open files using the right encoding. But you don’t have to wait for HTML-Tree 6; you can start using IO::HTML today. Just
use IO::HTML and then use
$tree->parse_file(html_file($filename)). (It also works with