HTML-Tree 5: Now with weakref support

HTML-Tree has long been a source of memory leaks for programmers who weren’t very careful with it. Because it uses circular references, Perl’s reference-counting garbage collector can’t clean it up if you forget to call $tree->delete when you’re done. Perl added weak references (a.k.a. “weakrefs”) to resolve this problem, but HTML-Tree has never taken advantage of them. Until now.

HTML-Tree 5.00 (just released to CPAN) uses weak references by default. This means that when a tree goes out of scope, it gets deleted whether you called $tree->delete or not. This should eliminate memory leaks caused by HTML-Tree.

Unfortunately, it can also break code that was working. Even though that code probably leaked memory, that’s not a big problem with a short-running script. The one real-world example I’ve found so far is pQuery’s dom.t. In pQuery 0.08, it does:

my @elems = pQuery::DOM->fromHTML('<div>xxx<!-- yyy -->zzz</div>')
                       ->childNodes;
my $comment = $elems[1];
is $comment->parentNode->tagName, 'DIV', 'Comment has parentNode';

Notice that it’s not saving the result of the fromHTML call; only the child nodes. Since children now have only a weak reference to their parent, the root node is deleted immediately, and $comment->parentNode is undef.

This can be fixed by saving a reference to the root node:

my @elems = (my $r = pQuery::DOM
                     ->fromHTML('<div>xxx<!-- yyy -->zzz</div>'))
                     ->childNodes;

As a quick fix for broken code (and to determine whether it’s the weak references that are causing the breakage), you can say:

use HTML::Element -noweak;

This (globally) disables HTML-Tree’s use of weak references. But this is just a temporary measure. You need to fix your code, because this feature will be going away eventually.

If you want to ensure that weak references are enabled, you can say:

use HTML::Element 5 -weak;

(It is necessary to include the version number, because previous versions of HTML-Tree simply ignored the import list.)

The next major change I’m planning for HTML-Tree is to make parse_file use IO::HTML by default. Right now, it opens files in binary mode, which means that it doesn’t do the right thing when the file isn’t ISO-8859-1. IO::HTML uses the HTML5 encoding sniffing algorithm to open files using the right encoding. But you don’t have to wait for HTML-Tree 6; you can start using IO::HTML today. Just use IO::HTML and then use $tree->parse_file(html_file($filename)). (It also works with new_from_file.)

5 Comments

This feels onerous.

You were making the user manage memory manually before, but in a way that could be ignored at the cost of leaking. Now you are forcing the user to do the memory management no matter whether he cares about leaks, and making him do it in a much more tedious way: where previously she had to manage the freeing of what she used, she is now having to manage the keeping alive of every single thing she isn’t using. Now the user has to be highly aware of the lifecycles of the objects whose memory he is managing, and to spell out that knowledge in his program, even it is monkey code in trivial cases and a big effort (and sometimes an impossibility) to get right in non-trivial ones.

All I can think is: if I want to program in C I know where I can find it.

You could instead return a proxy object with a DESTROY that calls delete on the proxied object, or some other trick along those lines, to avoid circularity within objects directly exposed to the user, so that GC will mop those up properly, and they in turn can then take care of the matter behind the user’s back. That way the lifecycles take care of themselves, as befits a dynamic language.

I am this opinionated about the matter because I have myself faced the same problem in XML::Builder, and (therein lies my dilemma) not yet solved it to my satisfaction. But to punt on it by taking the worst possible way out is, to me, a job abandoned rather than done (at all, let alone well)…

I *cleanly* solved a problem identical to this about 2 years ago. I have been trying to get someone to factor this technique into a standalone module with a sane "entanglement API", bit to no avail.
The technique is based on the fact that an object destruction in perl can be aborted from within a DESTROY.
All the relevant code can be found here. The main part of the implmentation is the two DESTROY methods here and here.
The code evolved since that time, you can see the current state in master (the same files, the same methods).

Hope this helps you de-break everything :)

You are a class act, ribasushi.

It is great to see support for weakrefs in HTML::Tree.

But why didn't you make 'no weakrefs' the default and weakrefs the explicit option?

You can still do it in the next release to limit the damage, but the poison is already out there: an author whose published code depends on HTML::Tree can no longer be sure whether weakrefs will be used wherever that code will be deployed.

Of course any release of any module may turn out to break existing calling code, but why do it intentionally?

Leave a comment

About Christopher J. Madsen

user-pic I blog about Perl.