Red-Black Trees in Perl 6 Explained

I've started digging around with Perl 6 again and I'm quite pleasantly surprised. As you can see, Perl 6 is just about feature-complete. The main obstacle to using it is probably performance. When I previously checked it was 20 times slower than Perl 5, now it's only 4 times slower and catching up rapidly (update: this information was from a conversation at FOSDEM and appears to not be correct. Perl 6 is still much slower. I will try to track down more information). Jonathan Worthington's port to the JVM is going well and Nick Clark mentions that initial tests on one (admittedly small) benchmark show it significantly faster than Perl 5!. With Perl 5 already being the fastest widely used dynamic language, this might be the key to Perl 6 being ready for prime time.

Currently Jonathan hopes to allow NQP (Not Quite Perl) to cross-compile itself to the JVM within a month and hopefully have Rakudo on the JVM in a "good state" before the summer conference season. He stresses that this is a an estimate, not a guarantee. Also, it appears to be more of an exploration of how well the JVM will cope with this sort of language (my words, not his).

Given the exciting developments, I decided to start taking a serious look at Perl 6 again. I read the red-black tree implementation in Perl 6 and decided to write up an explanation to both help you see what Perl 6 can do and to refamiliarize myself with the language. Please don't be discouraged by how complicated some of the explanation will sound: red-black tree implementations can be a bit daunting for someone who's never delved into them before. In fact, most of your Perl 6 code will be fairly straight-forward. This example is deliberately chosen because of the diversity of different features it uses to create something very powerful.

Now let's get a brief understanding of what red-black trees are.

As most of you probably know, a binary tree is a tree structure where each node has, at most, two children. They allow you to do very fast lookups of data, but a naïve binary tree has a slight problem. First, here's an unbalanced binary tree:

If you wanted to find the number 70, you'd look at the root node, value 40, see that 70 is greater than 40 and start searching at the root of the righthand tree. Then you find 70. That's two comparisons. But what if you wanted to find 5? You'd have to search all the way down the left hand tree structure for a total of six comparisons. For very huge trees (or with expensive comparisons), unbalanced binary trees can be expensive and potentially not better than a linear search through an array.

Here's the same tree, but balanced (in other words, the "height" of the tree is roughly even across branches).

Now you only need four comparisons to find the number 5. The problem with naïve binary trees is that they don't have a natural mechanism to prevent them from becoming unbalanced. Thus, you can run into serious performance issues with them if you're not careful. Red-Black trees, however, are what are known as self-balancing trees: as you add or delete nodes, they keep themselves more or less balanced. Here's that same tree as a red-black tree:

Those black circles with red Xs are null terminal nodes, a requirement of a red-black tree. For our Perl 6 code below, they're represented by an Any type (more on this later).

For the purposes of this write-up, the exact details of the red-black tree are not important, but you can read about red-black trees on Wikipedia if you're so inclined. What's important for us to know about the red-black tree is that the worst case for memory is O(n) and the worst case for search, insertion and deletion is O(log n). Red-black binary trees are fast and compact. Further, when you insert or delete into a binary tree, so long as you follow the rules correctly, the tree is self-balancing and produces an almost balanced tree (its properties guarantee its O(log n) performance even with a slight imbalance).

The code to insert new nodes into a red-black tree in Perl 6 looks like this (I've rewritten it slightly to make it easier to read and follow what's going on):

enum RedBlack <R B>;

sub MAIN { 
    my $tree = Any; 

    for (1..10).pick(*) -> $node { 
        $tree = add_to_tree($node, $tree); 
        printf "%2d: %s\n", $node, $tree.perl; 
    } 
} 

multi add_to_tree( $node, $tree ) { 
    [B, insert( $node, $tree )[1..3]]; 
} 

multi insert( $node, Any:U ) { [R, Any, $node, Any] } 

multi insert( $node, @tree [RedBlack $color, $left, $pivot, $right] ) { 
    when $node before $pivot { balance $color, insert($node, $left), $pivot, $right             } 
    when $node after  $pivot { balance $color, $left,             $pivot, insert($node, $right) } 
    default                  { @tree } 
} 

multi balance(RedBlack $color, $a, $x, $b) { [$color, $a, $x, $b] } 

multi balance(B, [R, [R,$a,$x,$b], $y, $c ], $pivot, $right ) {
    [ R, [B,$a,$x,$b],        $y, [B,$c,$pivot,$right]] 
} 
multi balance(B, [R, $a, $x, [R,$b,$y,$c] ], $pivot, $right ) { 
    [ R, [B,$a,$x,$b],        $y, [B,$c,$pivot,$right]] 
} 
multi balance(B, $left, $pivot, [R, [R,$b,$y,$c], $z, $d]  ) { 
    [ R, [B,$left,$pivot,$b], $y, [B,$c,$z,$d]] 
} 
multi balance(B, $left, $pivot, [R, $b, $y, [R,$c,$z,$d] ]) { 
    [ R, [B,$left,$pivot,$b], $y, [B,$c,$z,$d]] 
}

And here's the output from one run:

 8: [RedBlack::B, Any, 8, Any]

 6: [RedBlack::B, [RedBlack::R, Any, 6, Any], 8, Any]

 4: [RedBlack::B, [RedBlack::B, Any, 4, Any], 6, [RedBlack::B, Any, 8, Any]]

 7: [RedBlack::B, [RedBlack::B, Any, 4, Any], 6, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, Any]]

 9: [RedBlack::B, [RedBlack::B, Any, 4, Any], 6, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, [RedBlack::R, Any, 9, Any]]]

 3: [RedBlack::B, [RedBlack::B, [RedBlack::R, Any, 3, Any], 4, Any], 6, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, [RedBlack::R, Any, 9, Any]]]

 2: [RedBlack::B, [RedBlack::R, [RedBlack::B, Any, 2, Any], 3, [RedBlack::B, Any, 4, Any]], 6, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, [RedBlack::R, Any, 9, Any]]]

 1: [RedBlack::B, [RedBlack::R, [RedBlack::B, [RedBlack::R, Any, 1, Any], 2, Any], 3, [RedBlack::B, Any, 4, Any]], 6, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, [RedBlack::R, Any, 9, Any]]]

 5: [RedBlack::B, [RedBlack::R, [RedBlack::B, [RedBlack::R, Any, 1, Any], 2, Any], 3, [RedBlack::B, Any, 4, [RedBlack::R, Any, 5, Any]]], 6, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, [RedBlack::R, Any, 9, Any]]]

10: [RedBlack::B, [RedBlack::R, [RedBlack::B, [RedBlack::R, Any, 1, Any], 2, Any], 3, [RedBlack::B, Any, 4, [RedBlack::R, Any, 5, Any]]], 6, [RedBlack::R, [RedBlack::B, [RedBlack::R, Any, 7, Any], 8, Any], 9, [RedBlack::B, Any, 10, Any]]]

As I've mentioned, the exact logic of tree creation is not what is important here, but instead, we'll walk through the program to better understand how it does its magic.

Our first line is an Enum declaration:

enum RedBlack <R B>;

This allows us to create a new type, RedBlack, with two allowed values, R and B (Red and Black). They allow us to color our nodes.

Next, we have our MAIN subroutine. If you're a C hacker, you can guess that Perl 6 will automatically call this subroutine for you:

sub MAIN {
    my $tree = Any;

    for (1..10).pick(*) -> $node {
        $tree = add_to_tree($node, $tree);
        printf "%2d: %s\n", $node, $tree.perl;
    }
}

Note that if you have code outside of subroutines:

say "before";
sub MAIN { say "Main" }
say "after";

That code will run first and then MAIN will be called. The above will print "before", "after" and "Main" on separate lines, in that order.

We use my $tree = Any to assign the Any type object to tree. Any is sort of like the Any type in Moose. In Perl 6, you'll find that you cannot use my $tree = undef any more, so assigning the Any type here is more or less the same statement.

For the curious, here's a Perl 6 REPL session showing what happens when you try to assign undef to a variable:

$ perl6
> my $type = undef;
===SORRY!===
Unsupported use of undef as a value; in Perl 6 please use something more specific:
    Mu (the "most undefined" type object),
    an undefined type object such as Int,
    Nil as an empty list,
    !*.defined as a matcher or method,
    Any:U as a type constraint
    or fail() as a failure return

at <unknown file>:1
------> my $type = undef⏏;
>

Getting back to our red-black tree: the for (1..10).pick(*) -> $node { ... } statement is a for loop which assigns the elements from 1 to 10 to $node, but in random order. See the documentation for the pick() method to understand how it works.

The rest of the MAIN method looks pretty standard. We call our add_to_tree() method in that loop. It looks like this:

multi add_to_tree( $node, $tree ) {
    [B, insert( $node, $tree )[1..3]];
}

The multi keyword isn't actually needed here because we have no other add_to_tree() function, but I left it in there because that was in the original code.

You'll also notice a simple (and easy!) subroutine signature.

There's not much interesting here, other than to note the use of the B item from the RedBlack enum which states that the root must be a black node (as is required by red-black trees).

Note that add_to_tree() calls insert() and this function is declared twice with multi:

multi insert( $node, Any:U ) { [R, Any, $node, Any] }

multi insert( $node, @tree [RedBlack $color, $left, $pivot, $right] ) {
    when $node before $pivot { balance $color, insert($node, $left), $pivot, $right             }
    when $node after  $pivot { balance $color, $left,                $pivot, insert($node, $right) }
    default                  { @tree }
}

For the first number, (8 in our sample run above), we're guaranteed that insert() will be called with 8 and Any, thus ensuring that multi insert($node,Any:U) is called. The ":U" on Any:U is an adverbial saying that this value must be undefined, which it is for the first call. That returns an [R, Any, 8, Any], (Red, left child, value, right child) and the [B, insert( $node, $tree )[1..3]] takes the last three values from that, resulting in the first iteration of our tree being [B, Any, 8, Any].

Next, our loop is called with the number 6 (again, these numbers are random and your results will vary). So add_to_tree(6, $tree) will now be adding to a defined $tree. When we get to our first insert() call, we get this version:

multi insert( $node, @tree [RedBlack $color, $left, $pivot, $right] ) {
    when $node before $pivot { balance $color, insert($node, $left), $pivot, $right             }
    when $node after  $pivot { balance $color, $left,             $pivot, insert($node, $right) }
    default                  { @tree }
}

Looking closer, we see the signature is insert( $node, @tree [RedBlack $color, $left, $pivot, $right] ). This subroutine still takes just two values, a $node and a @tree, but what's that [RedBlack $color, $left, $pivot, $right]? That's a sub-signature. It says that @tree must contain a RedBlack first element, and three more of any type. Pass it an array with less than or more than four elements and ... well ... you can't. Perl won't find a multi-method matching that signature. Pass it an array whose first element isn't R or B and again, Perl won't find a multi-method matching that signature. You'll get an error message saying Can't call 'insert'; none of these signature match, followed by a list of signatures.

Since 6 ($node) is before 8 ($pivot), the following line is triggered in the sub:

when $node before $pivot { balance $color, insert($node, $left), $pivot, $right }

The balance() multi subroutine is used to ensure that the red-black tree stays balanced when you're inserting new nodes, but which one is called? Here are their signatures:

multi balance(B, [R, [R,$a,$x,$b], $y, $c ], $pivot, $right);
multi balance(B, [R, $a, $x, [R,$b,$y,$c] ], $pivot, $right);
multi balance(B, $left, $pivot, [R, [R,$b,$y,$c], $z, $d]);
multi balance(B, $left, $pivot, [R, $b, $y, [R,$c,$z,$d] ]);
multi balance(RedBlack $color, $a, $x, $b);

If we fill in the values for the when code, we get this:

when 6 before 8 { balance B, insert(6, Any), 8, Any }

The insert(6,Any) expands to [ R, Any, 6, Any ], leaving us with this:

when 6 before 8 { balance B, [ R, Any, 6, Any ], 8, Any }

When you walk through the signatures above, you'll see that none of the first four can match. Note how sub-signatures can themselves contain sub-signatures, but in this case, the original argument is not named, but we use variable names to extract the parts of the array reference being passed in.

So if you work through those signatures, you see we have the following subroutine called:

multi balance(RedBlack $color, $a, $x, $b) { [$color, $a, $x, $b] }

And ultimately leaving us with this for our tree with two nodes:

[RedBlack::B, [RedBlack::R, Any, 6, Any], 8, Any]

For the next iteration of the loop, we have the number 4. We again get to this line:

when 4 before 8 { balance B, insert(4, [ R, Any, 6, Any ]), 8, Any }

And that, after we skip some explanation of recursion (to avoid this being even more complicated than it already is), gets us this:

when 4 before 8 { balance B, [R, [R, Any, 4, Any], 6, Any], 8, Any }

And when you compare that to the signatures above, you see it matches this:

multi balance(B, [R, [R,$a,$x,$b], $y, $c ], $pivot, $right) {
    [ R, [B,$a,$x,$b], $y, [B,$c,$pivot,$right] ]; 
}

And if you take a close look, you'll see that our previous pivot, 8, is now in the right-hand tree and 6 has become our new pivot (root node), thus balancing our tree.

I'll skip the rest of the code as it's only relevant if you really want to understand how red-black trees work. I'll just say that this tiny bit of code shows a huge amount of power. By pushing into the core language problems that programmers used to solve themselves, you can let programmers create more declarative descriptions of their code rather than writing huge amounts of procedural spaghetti which are merely designed to work around limitations of their language's expressivity (is that a word?).

In fact, after working through this code and rewriting a bit, you could replace the insert() functions with this:

multi insert( $node, Any:U ) { [R, Any, $node, Any] }
multi insert( $node, [RedBlack $color, $left, $pivot where { $node before $pivot }, $right ] ) {
    balance $color, insert($node, $left), $pivot, $right;
}
multi insert( $node, [RedBlack $color, $left, $pivot where { $node after $pivot }, $right ] ) {
    balance $color, $left, $pivot, insert($node, $right);
}
multi insert( $node, @tree [RedBlack $color, $left, $pivot where { $node eqv $pivot }, $right ] ) {
    return @tree;
}

That eliminates the when checks in our code. Taking advantage of multi-methods can eliminate most uses of if statements (even hidden if statements such as given/when blocks).

Whether or not that's clearer to you remains to be seen.

Moritz has written a bit more about pattern matching and sub-signatures. He, Carl Mäsak and Jonathan Worthington also answered a couple of questions I had about tricky bits of this code. That being said, no one's reviewed this post so any glaring errors are mine.

Your homework: write the code to search this binary tree. It should return true or false, depending on whether or not a given value is in the tree. Remember that any node is represented as:

[ RedBlack $color, $left-sub-tree, $value, $right-sub-tree ]

For searching, color is irrelevant. Sub trees have, of course, the same structure. Terminals are Any. Recursion is your friend.

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/