DISTINCTly UnSETtling

Ovid on 2006-01-17T04:13:05

In a recent Perlmonks thread, I managed to tick a lot of programmers off by pointing out some problems with SQL ignoring set theory. I showed how what superficially appear to be a bunch of logically equivalent queries return different results. However, they all return the correct results if the "DISTINCT" keyword is used.

Unfortunately I used an unnormalized schema as an example in order to keep things simple. Many programmers focused on this. When I write things "off the cuff", I have an annoying tendency to think that providing a simplistic example will make the problem clearer. This is often true, but people frequently focus an obvious problem and ignore the larger ones. I sometimes do this myself. On the plus side, if I never posted anything controversial, that would suggest that either I'm following the herd or I'm not thinking enough, so on with the controversy!

The problem is that SQL returns bags of data instead of sets. Rarely do we ever want bags. Some argue that we really are getting sets instead of bags because "under the hood" the query engine is fetching off all of the data and merely masking out the data we don't want. The problem with this logic is that most people merely want to use SQL, not worry about the fiddly bits inside. Do you really want to write more garbage collection code in C? I don't want to have to stop at every query and wonder about whether or not I need to throw a "DISTINCT" in there. So the counter-example I provided was this a fictional story about an accountant who should be fired.

So a salesperson walks up to an accountant and says "Bob, for our customers with excellent credit ratings, what cities do they live in? I need to plan flights through those cities."

"What do you mean by 'excellent credit rating', Alice?"

"For the sake of argument, let's include everyone with a credit rating greater than 700."

"OK, let's see ... there's London. Oh, and there's London. And Paris. And London. London again, Athens, London, Paris, Moscow, London and Paris."

"That's a hell of a flight."

Needless to say, you'd be pretty irritated if your accountant did that. However, that's the equivalent of the following SQL:

SELECT city
FROM   city, customer
WHERE  city.id = customer.city_id
  AND  customer.credit_rating > 700

To make that do what you probably want, you need a "SELECT DISTINCT" in there. That's what SQL should do by default, with an optional "NODISTINCT" or something similar in there.

Ben Tilly offered an interesting rebuttal. He mentioned how he was once a junior programmer bitten by Microsoft Access's default behavior of returning distinct results. He was summing receipts, but Access, having DISTINCT on by default, was discarding duplicate values, not duplicate results. That's silly. It's like saying all guys named "Bob" are the same guy. The Microsoft programmers had the right idea and the wrong implementation. Further, since so many Access databases I've worked on have been poorly designed, I suspect this was a common problem.

For this to work properly, DISTINCT should discard duplicate elements from the same tuple. If you truly have duplicates from different tuples, you have a denormalized database and you need to either normalize it or adjust your query to take this into account. (For example, a customer table might list "CITY" instead of "CITY_ID").

The argument I hear against this is an all too common one. Every piece of data should theoretically be tagged with the data value and the tuple of origin (and also the data type, but the reasons behind this are beyond the scope of this post). This means that queries would be more memory intensive and thus slower.

Does this sound familiar? We usually call this "premature optimization". Database vendors have focused so long on making things fast and worried about making things correct as an afterthought. The original relational theory proposed by Codd and evangelized by Date did not have this problem, but SQL won the query language wars and we seem to be stuck with it.

Can anyone provide a counter-argument? What I mean, specifically, can you argue that the default behavior should be to return true duplicates? Do we really want bags more often than sets? Obviously, the following query might return apparent duplicates if we had a properly implemented "DISTINCT" on by default:

SELECT total FROM order;

However, that apparent duplication would only be there because you really can have multiple orders totalling to $100.00 (yes, that's probably an unnormalized example. Imagine that I referenced an ORDER_ITEM table and summed the line items.)

Frankly, I can't imagine why anyone would want bags of results other than the obvious fact that most databases frequently fail to optimize queries with "DISTINCT" in them. But that's telling me I'm supposed to worry about the implementation details again. I just want to focus getting the results I want, not how I'm getting them. SQL is a declarative language and those languages' greatest strength is that they are result oriented.


DISTINCT or ORDER BY ?

n1vux on 2006-01-17T15:45:35

A wise man said any SQL query without an ORDER BY clause is a stealth bug waiting to emerge when the implicit order dependency explodes into view.

In the customer-visit itinerary, ORDER BY may actually be a better solution than DISTINCT -- the repetition value in the BAG will be needed to help plan how long to lay over in each city, as it measures how many visits are required there. GROUP BY (City) COUNT(Customer) might be better, and would get you the equivalent of DISTINCT.

Re:DISTINCT or ORDER BY ?

autarch on 2006-01-17T16:50:05

If you need to know counts that's what "GROUP BY" is for, so we can do:
SELECT city, COUNT(customer_id) as customer_count
  FROM Customer
GROUP BY city
ORDER BY city
This will have no dupes and gives you the same numerical information as a dupe-containing result, but in a nice sane way, with no need to count dupes programatically in the program that executes that query.

Re:DISTINCT or ORDER BY ?

Ovid on 2006-01-17T17:08:24

You've come up with a rather artificial reason why one might, in this case, want a bag instead of a set, but your example is arbitrary. Who knows why Alice is in a particular city? Maybe she has some weird contractual agreement that she needs to at least be in those cities once a year? The point is the same: if I ask for information, why should I want my query, by default, to repeat itself rather than just give me the information I ask for?

Don't focus on the specifics of this particular problem because it's artificial. However, if you can prove that the specifics of this problem (Alice needs to know how long she should stay) generalize to the majority of queries, than I'll stand corrected. This could be some weird parallel to the Einstein/Bohr debates (on a far lower intellectual level) where Einstein thought Bohr was merely guilty of picking apart the particular examples he used but in the end, Bohr turned out to be right. Still, there would have to be some generalizable preference for repetition for this to work.

DISTINCTly DisORDERed

n1vux on 2006-01-17T19:12:52

I think we're in violent agreement that naive SQL is bad, and we're just arguing over which form of naivety is worst!

This is not just a artificial quibble on the artificial example. I gave a different, more general rule that explains why the artificial example is wrong at an even deeper level. I claim that in general both general rules need to be addressed together.

I agree that SQL having been designed abstractly by mathematicians (with whom I happily self-identify) so that it encourages thinking SETs yet was implemented so it often returns BAGs is a problem of abstraction. Having a SHOW DUPS keyword instead of a DISTINCT keyword, reversing the defaults, would be more practical yet theoretically sound. Furthermore, both SET and BAG are the wrong abstractions: a ORDERED SET is the usually right return type for all queries. And when a BAG is what is desired, it should be a Multiplicity-encoded Ordered Multiset, not an unordered Bag with repetition. (Obviously(?), a totally order is to be preferred over a partial order -- in SQL terms, that means using a long enough composite key.)

To repeat, both SET and BAG are unordered, pure mathematical concepts (dear to my heart), but in practical programming, we need order. Any query returning a true (unordered) SET with DISTINCT but no ORDER BY is just as broken if not more broken as one returning a true (unordered) BAG (without DISTINCT) which you've objected to. Just as BAGginess is fixed with DISTINCT (or COUNT .. GROUP BY ..), DisORDER is cured by ORDER (or GROUP BY).

If in the artificial example the query weilder had used ORDER BY [as all practical SQL queries should be written whether desiring BAG or SET, to avoid getting bitten by hidden order dependencies], the artificially weird result would not have looked so weird, and the multiplicity of cities in the artificial example would have had emergent semantics, and the accountant wouldn't need firing (emotional statement in original artificial example). And the desirability of applying COUNT ... GROUP BY would have suggested itself.

There is more than one way to convert a BAG to a SET with COUNT(... )FROM with ORDER BY ... GROUP BY ... , which is the MultiSet impelmentation of BAG, where the multiplicity is an attribute of the element rather than manifested by repetition, is a semantics preserving subsitute for DISTINCT.

I agree with your basic thesis that all results should be converted to SET form. However, I would prefer COUNT ... GROUP rather than DISTINCT and further require the result be ORDERED.