NULLs: yes, no, or don't know?

Ovid on 2005-12-10T20:43:54

So I finally got my grubby little hands on a copy of Database In Depth by C. J. Date. I've just finished Chapter 3 and so far, I've been blown away. My irritations with the shortcomings of databases have been wrong. They're generally shortcomings with SQL and people's (mis)understanding of relational theory.

One bit which fascinates me is Date's arguments against NULLs. It fascinates me because it really hits home regarding some problems I've had but it also irritates me because I've read these arguments before and not paid much attention to them. It's time for me to change that. Now, if I reject his reasoning, I want it to be because I've thought things through, not because I wasn't paying attention.

Part Date's rejection of NULLs stems from how terms are defined: an attribute (column) in a tuple (row) in a relation (table) must contain a name and a type. Since NULLs, by definition, are a not a type, it follows that an attribute must not have a NULL value.

Of course, without an explanation of the theory, many are going to argue that Date is just using wordplay to define NULLs out of existence. To deal with this, Date comes up with the following example:

+-----+--------+  +-----+--------+
| SNO |  City  |  | PNO |  City  |
+-----+--------+  +-----+--------+
| S1  | London |  | P1  | NULL   |
+-----+--------+  +-----+--------+

"SNO" is "Supplier Number" and "PNO" is "Part Number". If these tables are called "supplier" and "part", then fetching all part numbers is trivial:

SELECT PNO
FROM   part;

This should be logically equivalent to the following:

SELECT PNO
FROM   part
WHERE  City = City;

Of course, this query returns no rows because you cannot logically compare "unknown" values. This is fundamentally the simplest example one can give which demonstrates how NULL values can give erroneous results. However, the query is so ludicrous that many would be inclined to dismiss it out of hand. After all, we'd never write anything so useless, right?

Moving along, we need to write a query where we need to know where either the supplier and part cities are different, the part city is not Paris, or both. Here's one way to write that query:

SELECT S.SNO, P.PNO
FROM   supplier S, part P
WHERE  S.City <> P.City      -- first condition
   OR  P.City <> 'Paris';    -- second condition

Because we cannot compare NULL values and the only "part" has an unknown city, we get no rows. But is this correct? Either that unknown city is Paris or it isn't. If it is Paris, the first condition should evaluate as true and thus we'll get a result. If it is not Paris, the second condition should evaluate as true and we should get a result. Thus, logically we should get a result even though the query returns no rows.

The mere existence of a NULL in the database means we may get erroneous results in our queries. However, the more complicated the queries, the more difficult it will be to tell when we'll get erroneous results and, I suspect, the more likely it is when we will get erroneous results if the NULL columns are used in the WHERE clause.

These points really resonate with me. While I've certainly found three-value logic attractive (I don't like perl -le 'print "yes" if (undef) < 4' printing "yes"), it does appear to be problematic in the above example. I'm beginning to think that three-value logic may have a place in procedural or OO code, but perhaps not in declarative code. More than once I've worked on a large systems where NULLs were causing problems in queries that were often very difficult to track down.

What I'm wondering is what the counter-examples are. Such a fundamental issue shouldn't be controversial unless there are clear counter-examples which demonstrate why we do need NULLs. The "that's the way we've always done it" argument doesn't hold much sway with me because I view that akin to "hitting my thumb with a hammer is better than hitting it with an axe" -- just don't hit your damned thumb, will ya?

Note: avoiding the "NULL" problem can be dealt with via the technique presented in How to Handle Missing Information without Using Nulls (pdf) by Hugh Darwen.

Ow ow ow

sigzero on 2005-12-10T22:29:53

My brain is hurting!

Okay, not really. But this has been interesting reading.

Re:Ow ow ow

Ovid on 2005-12-10T23:53:12

What's really frustrating is that much of the problems associated with nulls would go away if database vendors would start supplying DBMS which enforce relational data (SQL doesn't). First, we would need something called "distributed foreign keys" -- mentioned in the PDF I link to. Next, databases would have to be optimized to handle 6NF, something which is rarely seen, and 1NF, something which is not well-understood. Further, queries would need to return relations. In other words, they would need to returns sets of tuples, not bags of rows. Much of the work that's being done to get around limitations with SQL appears to be applying bandages to crippled systems rather than examing the root cause -- ignoring the relational model.

The book is making my head hurt, too. One of the things which shocked me is realizing that the "object-relational impedance mismatch" is a myth. It's actually the "object-dbms impedance mismatch". If a DBMS properly supported the relational algebra, object databases would be natural instead of the kludgy, ad hoc affairs we have today. You really should get that book if you're interested in this.

Re:Ow ow ow

thepler on 2007-05-01T21:34:05

I can't believe I'm only now seeing this post...
Since reading Database In Depth, I've also realized that the object-relational impedance mismatch is a myth. It is a persistent myth, but a myth none the less.
Here's the quote from that book that I always go back to:
All of which goes a long way, incidentally, toward explaining why a true "object/relational" system would be nothing more nor less than a true relational system—which is to say, a system that supports the relational model, with all that such support entails. After all, the whole point of an "object/relational" system is precisely that we can have attribute values in relations that are of arbitrary complexity. Perhaps a better way to say it is this: a proper object/relational system is just a relational system with proper type support—which just means it's a proper relational system, no more and no less.

I take this as meaning that a proper type system is a prerequisite for building a proper relational system.

No Three-Value Logic, Please!

Theory on 2005-12-12T04:17:02

Ovid,

First of all, I don't understand how your example, perl -le 'print "yes" if (undef) < 4, is an example of three-value logic. The parentheses create a list, and a list in scalar context evaluates to the number of items it has, not the value of the first item.

This works well because, unless I'm missing something, there is no three-value logic in Perl. Perl's undef is not the same as SQL's NULL; undef always evaluates to false! IOW, it is a value, just an undefined one.

The easiest way for me to illustrate is to borrow Date's truth tables (T = TRUE, F = FALSE, U = UNKNOWN (NULL):

p | NOT p p q | p AND q p q | p OR q ====+======= =====+========= =====+======== T | F T T | T T T | T U | U T U | U T U | T F | T T F | F T F | T U T | U U T | T U U | U U U | U U F | F U F | U F T | F F T | T F U | F F U | U F F | F F F | F

Anyone who is familiar with SQL NULL will recognize this, as well as the headaches it causes—as demonstrated in the examples you cite. But Perl has no NULL. undef is not unknown, but undefined, and Perl's idea of truth mandates that undef always be false. Thus, Perl's truth tables look like this (T = TRUE, F = FALSE, U = UNDEF:

p | NOT p p q | p AND q p q | p OR q ====+======= =====+========= =====+======== T | F T T | T T T | T U | F T U | F T U | T F | T T F | F T F | T U T | F U T | T U U | F U U | F U F | F U F | F F T | F F T | T F U | F F U | F F F | F F F | F

And really, because undef is false, it's redundant to even have it on the chart. This is because Perl uses two-valued logic, like we tend to use in the real world:

p | NOT p p q | p AND q p q | p OR q ====+======= =====+========= =====+======== T | F T T | T T T | T F | T T F | F T F | T F T | F F T | T F F | F F F | F

This greatly simplifies things. To get back to Date's example, if DBMSs used Perl's undef's notion of truth for NULLs, you would get the proper results back in your query.

—Theory

Right, we need multivalued logics

n1vux on 2005-12-12T18:21:12
A wise man (was it Perlis or Bentley or was he quoting?) once said the only sensible values for constants are 0 and 1, all else are parameters that may indeed change with requirements. (Two:=1+1 was allowed as a special case for buffer-swapping, one in use and one in prep. Perlis said something to the effect of any supposed constant is someone else's variable.)
So Three-valued is probably a bad basis for a logic. Date's point about Nulls are there are many more one out-of-band value possible, and you rarely need only one of them. IEEE Floating Point has more than one out-of-band value (INF, -INF, NAN, does it have an underflow, I forget...).
The proper semantics of combining the OoB value(s) with other values and OoBV's will depend on the reason for, or semantics of, the OoBV's needed.
But this does NOT mean the 2-valued logics will always suffice.