RDF: What is it good for?

ziggy on 2004-07-09T03:00:26

I had an interesting conversation at work today. We have a product to develop that's still in the early stages of requirements analysis at the moment, and someone was asking if there's a role for RDF in the system.

First, a little background. We've built a pretty sophisticated publishing system to manage large corpuses of legal documents. All the standard bells and whistles -- loading up large batches of XML documents, managing subscriptions, multiple views, cross linking all over the place.

The data is pretty regular as far as data sets go. It's complex and hierarchical, but there's a whole big mess of it, and there are a small number of relationships to model. All of the data is sitting in Oracle, and the data model is pretty sound.

There are a few new features that might be easier to develop using RDF and a reasonable triple store. Or topic maps. Or something other than the hairy relational schema we've got in place. We don't know, but it should be an interesting diversion finding out.

So how would RDF fit?

Turns out that it probably wouldn't. RDF is great if you have a big mess of data but don't have a set structure. But we are a publisher, and like most publishers, we have to deal with a large volume of data that always fits into a set structure that's benefited from years of successive refinement.

Once you get past the gnarly syntax behind RDF/XML, the two systems are effectively equivalent. RDF is based on a solid foundation of graph theory, while relational databases are based on a solid foundation of set theory. In RDF, the focus is on resources and arcs between resource nodes. In a decent relational databse model, the "resources" are rows of data in tables, and the arcs are the primary key/foreign key relationships across tables.

As a result, relational databases are set up for pre-determining all of the relationships resources may have, and storing a whole mess of similar resources in an efficient manner. In RDF, the data model is totally unstructured, yet offers the same ability to do ad-hoc queries across the contents.

So, for big traditional publishers, with large volumes of regular data, it may turn out that RDF is actually a step backwards from the big relational databases that are currently in place. But for new applications, like FOAF, RDF is the perfect data model to accomodate data coming in from millions of different sources where the structure is both unknown and unknowable in advance.


Foundation's Edge

TorgoX on 2004-07-09T04:03:27

RDF is based on a solid foundation of graph theory, while relational databases are based on a solid foundation of set theory.

Is that like saying that pocket change is based on a solid foundation of Peano's Axioms?

Re:Foundation's Edge

ziggy on 2004-07-09T12:56:44

Not really. We've been creating data stores for about as long as we've had computers. There are a lot of ad hoc data storage mechanisms that have come and gone over the years. Try doing an inner join with a ISAM file in COBOL, or a CSV file for that matter. With relational databases or a good triplestore, it's a primitive operation.

Also, a good data model focuses on the properties of an entity, and infers the relationships across entities. That comes out of a good foundation on a solid theory. Ad-hoc data stores treat relationships between data elements as just more data, so they need to be actively managed (and can be corrupted rather easily). This kind of tedious makework is what led Date to find a better abstraction 30-odd years ago.

(FWIW, we're seeing the same thing in XML Schema languages. WXS is a bear, but RelaxNG is predicated on some automata theory I don't even pretend to understand. But I know it's easier to model stuff with RelaxNG than it is to do with DTDs, WXS, or a roll-your-own schema language.)