June 2012 Archives

HTML-Tree 5: Now with weakref support

HTML-Tree has long been a source of memory leaks for programmers who weren’t very careful with it. Because it uses circular references, Perl’s reference-counting garbage collector can’t clean it up if you forget to call $tree->delete when you’re done. Perl added weak references (a.k.a. “weakrefs”) to resolve this problem, but HTML-Tree has never taken advantage of them. Until now.

HTML-Tree 5.00 (just released to CPAN) uses weak references by default. This means that when a tree goes out of scope, it gets deleted whether you called $tree->delete or not. This should eliminate memory leaks caused by HTML-Tree.

Unfortunately, it can also break code that was working. Even though that code probably leaked memory, that’s not a big problem with a short-running script. The one real-world example I’ve found so far is pQuery’s dom.t. In pQuery 0.08, it does:

my @elems = pQuery::DOM->fromHTML('<div>xxx<!-- yyy -->zzz</div>')
                       ->childNodes;
my $comment = $elems[1];
is $comment->parentNode->tagName, 'DIV', 'Comment has parentNode';

Notice that it’s not saving the result of the fromHTML call; only the child nodes. Since children now have only a weak reference to their parent, the root node is deleted immediately, and $comment->parentNode is undef.

This can be fixed by saving a reference to the root node:

my @elems = (my $r = pQuery::DOM
                     ->fromHTML('<div>xxx<!-- yyy -->zzz</div>'))
                     ->childNodes;

As a quick fix for broken code (and to determine whether it’s the weak references that are causing the breakage), you can say:

use HTML::Element -noweak;

This (globally) disables HTML-Tree’s use of weak references. But this is just a temporary measure. You need to fix your code, because this feature will be going away eventually.

If you want to ensure that weak references are enabled, you can say:

use HTML::Element 5 -weak;

(It is necessary to include the version number, because previous versions of HTML-Tree simply ignored the import list.)

The next major change I’m planning for HTML-Tree is to make parse_file use IO::HTML by default. Right now, it opens files in binary mode, which means that it doesn’t do the right thing when the file isn’t ISO-8859-1. IO::HTML uses the HTML5 encoding sniffing algorithm to open files using the right encoding. But you don’t have to wait for HTML-Tree 6; you can start using IO::HTML today. Just use IO::HTML and then use $tree->parse_file(html_file($filename)). (It also works with new_from_file.)

About Christopher J. Madsen

user-pic I blog about Perl.