Handling Hierarchy in Databases

There have been a lot of solutions to solve the hierarchy in a relational database. I'm pretty sure that I've tried them all, but ultimately they all fell short for one reason or another. Nested Sets are a beautiful answer, but fall short if you have too much data. Materialized paths are great except that they don't maintain ordered siblings. Many others require complex joins, special triggers, or stored procedures that you can't use in SQLite, or would require you to do something different in each database you support.

Through much trial an error, I solved the problem with a mechanism that is easy to use, fast, and will work with any SQL database. It doesn't require stored procedures, triggers, joins or anything else, so it will work even on the simplest of databases. I call that solution Lineage.

I actually solved this problem years ago for WebGUI, but recently it came up again with Ovid writing his new forum system. So hence the reason I bring it up now, in the hopes that I can save others much pain. I should say, this solution isn't just for forums. It will work for anything that you need to store hierarchically.

Lineage works like this: It's a string of zero padded numbers, that combine together to form an ordered path. A lineage might look like this:

000001000004000059

If you were using lineage in a forum what that might say is that you're looking at the 59th reply to the 4th reply to the first post. But I'm getting ahead of myself. Let me break it down further.

Each set of 6 numbers is called a rank. You join the various ranks together to form a lineage. So in this case we know that 59 is a child of 4, and 1 is the parent of 4. The number of characters you use in the rank doesn't have to be 6, but however many you use, must be consistent through out. If you use 6 characters, then you will have up to 999,999 leaves per branch of your hierarchy. If you know you don't need that many nodes, maybe you'd rather go with 4 characters, in which case the same lineage would be written like this:

000100040059

It reads the same, but now you can have only 9,999 leaves per branch. That's still probably more than enough for a forum system, because that means that each post can have 9,999 replies, and each reply can have 9,999 replies. The only problem with it might be that it would limit you to 9,999 posts in a given forum. Maybe not a bad problem if you have a medium sized community and you archive the forum each year. Anyway, I recommend using 6 characters, so that you have 999,999 leaves per branch, but feel free to structure it to suit your needs.

So now that we know we have a series of ranks that go into making up a lineage, how does that help us? Well it's simple. I can now write queries like this:

select * from forum where lineage like '000001%' order by lineage;

That says, give me the first post and all of it's replies, in order. So in one query, with no joins, and no stored procedures, I can pull back an entire hierarchy.

Let's say you wanted to get all the direct replies to the first post, but not the first post or any of the replies to the replies. You could then write something like this:

select * from forum where lineage like '000001%' and length(lineage) = 12 order by lineage;

The lineage can also tell you all sorts of things. For example, given this lineage:

000007000099000342000008

I can tell you the following things:

1) We're dealing with the 7th post in the forum.

2) It's parent is 000007000099000342

3) There were lots of replies to 000007000099 and really even to 000007, so this is a hot thread.

4) All of the replies to this reply will start with 000007000099000342000008

5) All the siblings to this reply with start with 000007000099000342

6) There are at least 7 siblings to this reply. (Or at least were if none were deleted).


Other Advantages
There are lots of other advantages to using lineage as well. For example:

1) If you want to use lineage, but you want to have millions of leaves per branch, you can use either hex, or base 64, instead of base 10 numbers in the lineage. That gives you far more levels of hierarchy in a far smaller lineage.

2) When a branch is deleted, lineage ordering is maintained, and none of the other branches needs to be updated.

3) Branches can easily be moved into other branches through a simple program to rewrite the lineage. And likewise, children can actually become roots the same way.

4) The lineage is quite human readable, and still very database friendly. This leads to very easy debugging when problems arise.

5) When new branches are added, the existing branches do not need to be modified in any way.


Beyond the Box
Lineage can actually organize more than just a bunch of threads in a forum. You could extend lineage out to each forum, so that the forums could be organized into a message board. ie: the forums themselves have a lineage, so the first rank in a post's lineage is actually which forum it belongs to.

It could then be extended out to either pages or sections in the site that the forums belong to, so that if you had a lineage like this:

000001000007000003000342000008

What you're seeing is from the home page (000001), we are in section the support section (000007), on the developers forum (000003), looking at the 342nd post (000342), and more specifically the 8th reply to that post (000008).


Conclusion
I could go on and on with examples, but this post is already practically a chapter in a long book. So I think I'll leave it there, and let you ask questions if you have any. I hope this helps anybody who's running up against this problem. It's fast, reliable, compatible, and easy to implement/use. In 10 years of developing hierarchical systems that are persisted in relational databases, I've not come across any solution that works better.

4 Comments

A small technical rub, isn't this handling the hierarchy in the data rather than in the database, even though that data is in a database?

Another thought, what your call lineage is not so very different from SNMP. A similar idea is used there, and in fact many many other places in the computing world. You've given each message a 'serial number' or 'message number' so to speak.

I had thought this not a valid answer to the problem since it's not derived from the database itself... oops

Aren't what you describing essentially materialized path?

Leave a comment

About JT Smith

user-pic My little part in the greater Perl world.