I was so terribly wrong about databases

Ovid on 2005-12-06T07:00:50

Yeah, I've heard of Date and Codd. I've even read some of the former. I have felt moderately confident in my database knowledge. And then a recently published interview with Date has made it clear that I've committed the sin of listening to the majority rather than digging in for the facts. In particular, this passage simultaneously reveals my ignorance and makes me happy:

Object/relational DBMSs: To a first approximation, "object/relational" just means the domains over which relations are defined can be of arbitrary complexity. As a consequence, we can have attributes of relations--or columns of tables, if you prefer--that contain geometric points, or polygons, or X rays, or XML documents, or fingerprints, or arrays, or lists, or relations, or any other kinds of values you can think of. But this idea too was always part of the relational model! The idea that the relational model could handle only rather simple kinds of data (like numbers, and strings, and dates, and times) is a huge misconception, and always was. In fact, the term object/relational, as such, is just a piece of marketing hype ... As far as I'm concerned, an object/relational system done right would simply be a relational system done right, nothing more and nothing less.

There's quite a bit more where he tackles common misconceptions and how they contradict the relational theory. I believe I have another book to buy (though I'm still not certain how I feel about his objection to NULLs, but I'll read again what he has to say).

In other words, it turns out that much of what I've said about the relational databases was me being ignorant and being cocky about that ignorance. I'm more than a little embarrassed by this, but it's better that I just spit it out now.

And in other news, since Andy's already been announcing this, check out news.perlfoundation.org.


Database in Depth

davorg on 2005-12-06T12:56:42

Reading Date's book recently just reminded my of how much database theory I've forgotten since I was at college. He's right, of course, that the current set of popular database systems only implement a small part of Codd's original ideas[1] and that SQL is only a relational language in the very loosest meaning of the word.

But, of course, we all need to get some real work done. And you're going pay the rent a lot easier if you're an expert on SQL that you are if you're an expert on Tutorial D[2] :)

[1] If this kind of stuff is interesting to you then it's worth going and reading Codd's original paper.

[2] Except, perhaps, if you're Chris Date.

NULLs

Theory on 2005-12-06T15:55:45

I haven't got to his criticism of NULLs in the book, yet, but I have my own ideas about why they suck. First of all, a NULL is supposed to mean "unknown." The problem comes when you need to represent "known, but empty." This is fine for character fields, where you can set an empty string (Larry calls it a "null string"), and it not null but has no real value, either. But how do you do that for, say, integers? For dates? You could use 0 for integer, but that's an actual value. You could use '0000-00-00T00:00:00' for the date, but that's an actual date (Happy Birthday, Jesus!). So what do you use for "the value is not unknown, it's intentionally left blank"? SQL gives you no way to do that.

I've been meaning to blog this for a while, but what the hell, might as well just say it here.—theory

Re:NULLs

Ovid on 2005-12-06T16:30:07

I just finished reading how to handle missing info without nulls. It looks really long, but there's a goof up in the PDF and the slides have been duplicated. The actual talk is very short and then there's a step-by-step walkthrough of the D-like query they use to get the answer.

One potentially unsatisfactory approch is to use an extra column to denote the type of null information of another column.

id   name    salary  salary_reason
1    Mary    20,000  null
2    John    null    unemployed
3    Bob     null    unknown

You've now successfully disambiguated the information. I really like the examples put forward in the paper I linked, but this single relation seems much easier to work with. The example in the paper would turn that into four tables and creates "distributed" constraints for them. However, I really do like how the inserts wind up getting handled so the database remains consistent between atomic operations rather than merely guaranteed to be such at the end of a transaction.

Re:NULLs

jplindstrom on 2005-12-06T19:03:41

What's the difference between an "atomic operation" and a "transaction"?

Re:NULLs

Ovid on 2005-12-06T19:43:59

A transaction may contain many atomic operations and becomes, itself, an atomic operation. However, the database itself may be in an inconsistent state during a transaction. The relational theory outlined by Date requires that no atomic operations allow the database to be in an inconsistent state. For example, let's say you have an employee relation (table) and salary values are moved into separate tables defining them as either "unknown" or "unemployed" (for previously null values) or "salary_amount" for known values.

For the "unknown", "unemployed" and "salary_amount" tables, they have a foreign key constraint against the employee id. However, the employee id has a distributed foreign key against the "unknown", "unemployed" and "salary_amount" tables. The way this is traditionally handled is to start a transaction, add an employee and then add an entry to one of the "unknown", "unemployed" or "salary_amount" tables. Then you have to validate that you only have one of those tables with an entry for a given employee. The transaction this is wrapped in requires at least two inserts.

If multiple inserts were allowed in a single statement, you could make this atomic with a single statement (note the comma):

INSERT_ROW INTO employee (1234, 'Bob'),
INSERT_ROW INTO unemployed (1234);

There are a few problems with this approach. First, DBMS generally (not universally) don't allow this type of insert. Second, SQL is a rather limited language and does not allow for easily or efficiently manipulating data in 6NF (whee! I didn't know about 6NF until yesterday). Also, implementing distributed foreign key constraints generally requires writing rules or triggers to manage them as SQL does not provide this functionality. If you read the PDF I link to, it explains it in more detail.

I also confess that I might have gotten some details wrong as I'm just learning about some new concepts. However, so many "problems" I've seen with relational databases are not problems with the databases but arbitrary constraints placed on us by SQL. It appears that my biggest objections (no lists in fields and lack of recursion) have nothing to do with relational data. Had DBMS designers worked to create a query language which truly supports relational data I suspect that databases would be much further advanced today. We wouldn't see these horrible XML or object databases as these concepts are natively supported by the relational module but not by popular DBMS.

The difference between theory and reality...

djberg96 on 2005-12-06T22:52:33

I wouldn't beat yourself up too much. He has a theory. A theory that he hasn't been able to implement, nor find someone else to implement, in 35 years. A theory that no commercial implementation has felt particularly compelled to follow to the letter, and which we've managed to do without.

But, hey, at least we have a (partial) implementation of Tutorial D to replace SQL. That only took 35 years! Woot!

This isn't to say that things couldn't be better. But writing about it and actually *implementing* it are two different things. I mean, I certainly find this interview interesting, but I'm not holding my breath that anything compelling will come of it. Maybe I'm wrong, though.

And on that note, time to check out Alphora.

Re:The difference between theory and reality...

Ovid on 2005-12-06T23:15:37

There's a possible equivocation on the word "theory". Date means theory in the scientific sense, not in the popular sense of "opinion" or "guess". The relational theory is well-grounded in set theory and predicate logic. This is a large body of mathematical theory which has been built upon since the ancient Greeks. Since even the crippled understanding of relational theory which most "experts" rely on is ignored by the masses, it's quite understandable to me that database vendors, following the herd, are unable to appreciate the math behind relational theory. This is similar to what we see every time some amateur comes up with their own crippled "XML variation", for example.

Date argues that we would not second guess an architect who understands how to build a skyscraper. Nor should we second guess aeronautical engineers designing a helicopter. In the case of databases, the people doing the building are creating crippled implementations which usually don't kill anyone. As a result, since the crippled variants "work", people challenge Date to prove that the math of the relational model provides business value. Instead, he argues, either show that the math is wrong or demonstrate how the "ad hoc" implementations which exist today provide a great value beyond "this is what we have and we're going to stick with it". I'm not saying Date is right (I don't have the math background for it but my work with logic programming suggests to me that he's not wrong), but his challenge is valid. Why should we settle for broken stuff that we can patch with bandages and bailing wire when it's long been known that there's a mathematically proven model that is both more correct and potentially easier to work with?