Databases: Paging and Hierarchies

gnat on 2003-03-31T12:34:24

I just put the database chapter to bed, and found two interesting things while working on it. First was that there doesn't seem to be a generic module for paging through results. I ended up writing code to do this for the book (you know, displaying a subset of a table with "Viewing Results 26-50" and the appropriate back/forward buttons) and while it was web-specific, I kept thinking "it wouldn't be too hard to generalize this". Before I do so, has anyone else invented this particular wheel first?

And I learned a cute trick for searching hierarchies. You know how to store a tree in a table, right: you give each node an id and store the parent id of the node as well:

CREATE TABLE node ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, parent INT, payload TEXT )

So now you can find the children of node 5 easily: SELECT * FROM node WHERE parent=5. But it's hard to select all the children of node 5. That requires a tree traversal, which involves lots of database queries, which gets jugly fast.

The cute trick is to build another table containing the path of each node (".1.5.12.19." means that this is node 19 whose parent is node 12 whose parent is node 5 whose parent is node 1). Then finding node 5 and its children is as simple as:

SELECT id,path FROM paths WHERE path LIKE "%.5.%"

Suuuper sneaky! I'm really beginning to appreciate how different it is to program in SQL...

--Nat

HTML::Pager?

LTjake on 2003-03-31T12:41:39

I think HTML::Pager fits the bill (atleast somewhat). Can't say I've ever used it, though. Looks like it depends on HTML::Template for its output (although it claims it can still be used without it). HTH.

Data::Page

Dom2 on 2003-03-31T12:49:40

You probably want to investigate Data::Page and it's cousin Data::PageSet.

I wrote a paging module before. I wouldn't wish the edge cases on anybody.

-Dom

Re:Data::Page

Dom2 on 2003-03-31T12:55:34
Oooh, I forgot to mention Class::DBI::Pager, which is wizzy if you're using Class::DBI.
-Dom

Re:Data::Page

gnat on 2003-04-01T00:01:09
Well bollocks. I could have sworn I googled without success for "page database results in perl". So much for being done with Chapter 14! Thanks,
--Nat

Re:Data::Page

brev on 2003-04-09T09:21:35
This is ancient history by now but Tim Bunce supposedly did a review of this, at OSCON in 1999 or 2000?

http://www.carumba.com/talk/perl/multiview.shtml

IMO paging through RDBMS results is always going to be ugly since SQL is a set-oriented language. You *have* to break the model to do that.

Cute Tricks and Premature Optimizations

ziggy on 2003-03-31T14:41:39

The cute trick is to build another table containing the path of each node (".1.5.12.19." means that this is node 19 whose parent is node 12 whose parent is node 5 whose parent is node 1).

Beware that this is a premature optimization.

While this technique works fine with a static tree, it makes tree transformations hideously difficult. Imagine the pain that comes from moving node 12 to be a sibling of node 5, or removing node 5 and consequently promoting their children up a level.

The other technique of using a single integer column to store parent_id is the most flexible, even if the SQL is slightly hairier.

Alternatively, you can select the entire subtree as a two pass operation. First, (recursively) find all of the children of node 5. Next, build a UNION query of all parent nodes that are part of the subtree rooted at node 5. If you build your union query properly, the nodes also come back in-order.

Re:Cute Tricks and Premature Optimizations

Dom2 on 2003-03-31T15:57:55
Heh, if you wanted to be really ugly, you could keep the parent id in the sme table, but have a trigger to rebuild the secondary table with the complete path-like listings. It'd be heavy on the database, but might be affordable depending upon how many updates you're doing.
-Dom

Re:Cute Tricks and Premature Optimizations

gnat on 2003-04-01T00:12:43
It seems a bit premature to call this premature :-) As with every problem, the right data structure depends on your data and how it's accessed. Just as an alphabetized flat file is quick to binary search but slow to insert, whereas an unordered flat file is quick to insert (append) but slow to search, you choose your solution based on what you know about the data. If you were going to be doing a lot of hoists or reparenting operations, then I guess you'd have to test it in the field. This doesn't change the cuteness of the hack, though.
How do you recursively find all the children of node 5 without doing a ton of SELECTs? That's the problem that a path table gets around.
--Nat

Re:Cute Tricks and Premature Optimizations

ziggy on 2003-04-01T00:50:20

How do you recursively find all the children of node 5 without doing a ton of SELECTs? That's the problem that a path table gets around.
I personally don't think that the "ton of SELECTs" is all that horrible. That's probably my personal style though.
It starts out with something like this:
SELECT id FROM nodes WHERE parent IN (5)
...which returns a list of IDs. That list of IDs then goes into a new SQL statement. The IDs are also pushed into a list that contains all IDs returned from your per-level queries:
SELECT id FROM nodes WHERE parent IN (6, 7, 8, 9)
That process is very amenable to a simple while loop:
my @ids = (); my @new_ids = (5); ## starting condition while (@new_ids) { push(@ids, @new_ids); my $set = join(", ", @new_ids); @new_ids = do_query("SELECT .... WHERE id IN ($set)"); } ## @ids contains all children rooted at the subtree at node 5
At the end of that, you've executed N queries to find all nodes that are 1..N levels beneath your root. From there, it's only one more query to obtain all nodes. You could even rewrite the queries to return all the children at each level over N queries instead of doing the N+1 query. That N+1st query can help to automatically order the nodes though.
With this property, I don't know that the work to "compile" the tree structure into a text field is all that much of a benefit. But that's just my gut feeling, and of course I could be wrong about this.

Re:Cute Tricks and Premature Optimizations

darobin on 2003-04-01T09:50:22

I've written code as the one you describe dozens of times and I must say that it certainly was "fast enough". I'd be worried if I had very deep trees though, as that would be when it could become costly.

But then why store a tree in an RDB? Wouldn't XPath be a *lot* more pleasant to access random things in a tree? ;)

Hierarchies

autarch on 2003-03-31T16:32:06

If SQL didn't suck so much, it'd offer built-in methods to explode trees and return the result. But it does suck.

Fabian Pascal talks about hierarchies a fair amount in his book Practical Issues in Database Management. Interesting stuff.

Data::Page with Class::DBI

cwest on 2003-03-31T17:03:21

My current favorite DBI abstraction is Class::DBI. In any case, paging has become just easier with Class::DBI::Pager.

You'd be better off converting hierarchies to sets

jnoble on 2003-03-31T20:03:59

Nat,

I've been doing a bunch of hierarchy work recently in SQL, and the hands-down best way to do the type of typical hierarchy queries that I do in SQL is to store (or at least additionally represent) hiearchies as sets.

See, SQL is a set-operation language. So having hierarchies stored as an adjacency list (e.g. employee table has a 'manager' column) -- although sufficient -- forces you out of SQL and into a procedural language for tree-walking.

Imagine you want to get "this person's management chain" or "everyone in that person's organization" or "all the managers". But you don't want to just get that, you want to mix in all the other cool SQL predicates like "AND pay > 50000 AND state = 'CO'".

Joe Celko describes an excellent approach to this in his _SQL for Smarties_ book. To set this up for the first time, you need to do a recursive pass through the tree -- but you only need to do this once. From then on you can express very powerful hierarchy relationships in SQL very naturally.

Essentially you do a depth-first traversal of the tree, starting at the root, and assigning "left" and "right" numbers along the way from a single monotonically increasing counter. So the root node gets left=1. Then you recurse down to the manager's first subordinate, assign left=2, and recurse down to her subordinates, if any. If at any point a node has no subordinates OR you've already walked them all, assign that node's 'right' to be the next number. That means that the final number will be assigned to the root node's 'right' attribute.

Here's the cool part: Once you're done, here's some examples of what you can do:

(jnoble's left=233 and right=534 for these example)

My managment chain (my boss, bosses' boss, etc. up to the root):

SELECT id FROM employees WHERE left = 233 and right > 534;

While others will have left values less than mine, and right values greater than mine, only my management chain will have both.

My entire organization (me and everyone 'below' me):

SELECT id FROM employees WHERE left >= 233 AND right = 233 AND right = 534 AND state = 'CO';

All company managers (people with other people reporting to them) who are on leave:

SELECT id FROM employees WHERE right = (left+1) AND status = 'Leave';

(Leaf nodes have 'right' values that are one more than their 'left' value -- anyone else must be a manager.)

Combine together with your basic logic statements to grab groups comprised of multiple sub-hierarchies, subtracting out other sub-hierarchies, joining to other tables, etc. etc. It's a very "native" way to work with hierarchies in SQL.

In practice, I keep the adjacency list ('manager' attribute) as my source of record, since that's how HR gives it to me. Plus it's useful to get immediate manager or immediate direct reports. Each time I reload the HR data into my tables, I do a quick traversal in perl to add 'left' and 'right' columns. Then I can do some pretty darn cool hierarchy stuff right in SQL without having to pop into a procedural language each time.

Oh, and: changes within the hierarchy are supported well by this model. See Celko's book for details, among other places. For my purposes, since I'm just a consumer of the hierarchy data that's kept up-to-date on a separate system, I haven't had to bother. But it's nice to know that I could.

Joel Noble

Re:You'd be better off converting hierarchies to s

jnoble on 2003-03-31T20:24:55
Crap, my post got its < and > mangled, even though I told the submission widgit it was plain text.
Here's the decoder guide -- not actual SQL.
My Management Chain: Select where left LESS THAN my_left AND right GREATER THAN my_right
My Vast Empire Select where left GREATER THAN OR EQUAL TO my_left AND right LESS THAN OR EQUAL TO my_right Joel Noble

Finding subtrees

dws on 2003-04-01T01:31:20

If you have a static tree (i.e., no inserts or infrequent, scheduled inserts), there's a trick from one of Joe Celko's SQL books that might apply. (From memory...) Walk the tree using a linear sequence generator that starts at 1. As each node is reached pre-order, store (insert/update) the next id that well be generated. The next id represents the "left" side of the subtree rooted under each non-terminal node. As the node is reached post-order, assign (update) its id from the sequence generator. When you do this, the root id can be found via

SELECT COUNT(*) FROM tree

SELECT MAX(id) FROM tree

and the entire subtree under a node can be selected by

SELECT * FROM tree WHERE id >= ? AND id < ?

where the values you plug in are the "first id in subtree' value cached in the parent node and the node id of the parent.

This scheme works wonderfully when you have static data, or data that you can update (and re-walk) on a scheduled basis, but fails miserably if the table needs to be updated live.

Re:Finding subtrees

gnat on 2003-04-01T01:42:40
You're the second person to recommend a Celko book. Hi ho, hi ho, it's off to Amazon I go ... :-)
Thanks!
--Nat

Materialized Path

treborhudson on 2003-04-02T22:47:22

I just read an article today on something similar to this. Read the section on 'Materialized Path' at this site:
http://www.dbazine.com/tropashko4.html

tree traversal in SQL

brev on 2003-04-09T09:52:25

Nobody has mentioned Oracle's CONNECT BY. Hierarchical queries in one statement!

I doubt Ziggy is aware of this, but the tree hack you describe is roughly how ASPN works. Ugly and inflexible, yes, but it gave us the performance we wanted, without having to shell out for Oracle.

It may have been a premature optimization. I personally was horrified by the idea during the design meetings. But it works very well in practice and it's easy to understand.