Guess the Solution: Materialized Paths and "ORDER BY"

Ovid on 2009-08-14T15:07:26

Imagine that you need to store the following tree structure in your database:

    5---
   / \  \
  62 32  7
 /  /   / \
9  3   29  1
          / \
         48  55

One way this is typically handled is by creating a table linking parents and children:

parent | child
-------+------
5      | 62
5      | 32
5      | 7
62     | 9
... and so on

Because SQL isn't brilliant at recursion (and because most of us aren't as brilliant as Joe Celko), we find ourselves writing several SQL calls to walk up the tree. However, this becomes prohibitively expensive when you have millions of objects in your tree and you need to repeatedly traverse it. Enter materialized paths.

A materialized path is a database denormalization where you flatted each branch of the tree. You join all of the parents, in order, with a separator character which is guaranteed not to be in the id. So the materialized path for 48 might be 5.7.1.48.

node | path
-----+---------
1    | 5.7.1
3    | 5.62.3
5    | 5
7    | 7.32
9    | 5.62.9
29   | 5.7.29
32   | 5.32
48   | 5.7.1.48
55   | 5.7.1.55
62   | 5.62

They can be tricky to maintain, but if you get them right, you can find all parents IDs in a single SQL statement.

Today I ran into a nasty problem where I needed the parents, but I had to guarantee that the order of those parents was preserved at the database level (via ORDER BY). The SQL looks sort of like this:

SELECT *
FROM  tree
WHERE id IN (5,7,1,48)
ORDER BY ...

Um, what goes in that order by? I need the items returned in the same order as the ids in the IN ids. I asked many people and they were stumped. They came up with Oracle specific answers. Some suggested writing the ids to a temp table with sort criteria and joining on that. Others through up their hands. One person suggested a complicated case statement.

Then one of my colleagues, Mark Morgan, said "can't you just ..."

The solution was rather counter-intuitive to me, but once I saw it, I was dumbfounded in its simplicity. Do you see it?

Update: I've posted the solution in the comments, so if you want to think about it a bit first, don't read 'em.


FIELD?

Adrian on 2009-08-14T15:33:03

Too lazy to test it by would something like

ORDER BY FIELD( id, 5,7,1,48)

do the job?

Re:FIELD?

Ovid on 2009-08-14T15:46:29

I think that's MySQL specific. The very simple solution is much easier.

Re:FIELD?

Adrian on 2009-08-14T15:57:25

Ah...

ORDER BY CHAR_LENGTH( path )

?

Re:FIELD?

Ovid on 2009-08-14T16:11:57

That's pretty much it. I responded to Ricardo with the solution and a caveat.

too easy?

rjbs on 2009-08-14T15:43:00

This seems obvious?

    ORDER BY id=5, id=7, id=1, id=48

Re:too easy?

Ovid on 2009-08-14T15:49:47

Is that SQL standard? The solution I've gone with is actually much simpler.

Re:too easy?

rjbs on 2009-08-14T15:52:00

I have never seen a SQL where that didn't work. I am also interested to see what could be much simpler than that. :)

Re:too easy?

Ovid on 2009-08-14T16:03:20

OK, a few people on Twitter have struck out on this and Abigail guessed something similar to what you and Adrian guessed.

The solution is:

ORDER BY length(path)

Turns out it's a deterministic emergent property of materialized paths. There is one caveat I realized, though. If you have multiple trees and you want to see the root nodes, you can do this:

SELECT id FROM tree WHERE id = path;

However, some people omit the ID from the path. So you'd have this (in particular, note the path for the 5 node):

node | path
-----+---------
1    | 5.7
3    | 5.62
5    | NULL
7    | 7
9    | 5.62
29   | 5.7
32   | 5
48   | 5.7.1
55   | 5.7.1
62   | 5

To get the parents, you'd have to do this:

SELECT id FROM tree WHERE id IS NULL;

But that means when sorting paths by length, length(node) sometimes returns NULL and the sort order on that is database dependent (and usually configurable). So you would need to do this to sort the ids correctly:

ORDER BY COALESCE(length(path),0))

Re:too easy?

rjbs on 2009-08-14T16:21:50

Seems problematic:

1.23.45.67

1.2.3.4.5

Perhaps you could count dots.

Re:too easy?

Ovid on 2009-08-14T16:24:31

No, you can guarantee that conflicting paths won't exist as you walk up the tree.

Re:too easy?

Adrian on 2009-08-14T16:25:27

_"Seems problematic:"_

Nah - the problem is selecting all the parents of a child in order. The length of the path is guaranteed to increase.

Re:too easy?

huxtonr on 2009-08-15T08:59:33

The advantage of explicitly counting dots (rather than relying on string length to always increase with the number of dots) is that it makes path-lengths comparable. Not directly applicable to this problem, but not uncommon.

Re:too easy?

jmazon on 2009-08-14T16:41:10

Wouldn't it be enough to just ORDER BY path?

Re:too easy?

Ovid on 2009-08-15T07:09:08

With the caveat above, yes, I think it might be.

Have you looked at nested sets?

melo on 2009-08-15T08:14:48

Depending on the size of the tree, I've found that the nested sets representation is very nice for queries.

The query that you need is given as an example in this page:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

(Although this is a mysql.com site, there is nothing MySQL specific with the solution)

The problem with nested sets is updates: they might change a large number of records - worst case, inserting a new node at the left-most place will update all the rows.

But smaller trees, they work very very well.