Find the bug (sql)

Ovid on 2009-12-07T07:43:40

Update: You can ignore the order_date below. It's a red herring and I probably should have left it out, but I had liked the fact that by putting it in the query, I added more complexity, thus making the real bug more difficult to spot.

Assume you're a diligent programmer. You've designed your database carefully. Foreign constraints are correct, you have no null columns and you've kept a nice, simple design. Now your boss wants you to provide a list of all customers who've spent less than £15 on your Web site because you want to offer them a special promotion. Here's the SQL you've written:

SELECT    first_name, 
          last_name, 
          order_date, 
          SUM(price) AS total
FROM      customer
LEFT JOIN orders     ON customer.id = orders.customer_id
LEFT JOIN order_item ON orders.id   = order_item.order_id
GROUP BY  first_name, last_name, order_date
HAVING    total < 15
ORDER     BY order_date ASC;

The two left joins are there because a customer may never have placed an order before. Heck, they may have started an order but not added any items to it. The 'having' statement is required because you generally can't use aggregates in where clauses.

You run the SQL and hand-check the results very carefully. Choosing a random sampling of customers returned, you verify that none of them have spent more than £15 on your site. Nonetheless, you have a bug. What is it? What are the implications of the bug?


Guessing...

jarich on 2009-12-07T08:52:31

I think your boss wants you to give a special to your customers who have given the business less than 15 pound ever. Yet you're grouping by order_date. Thus if your customer has purchased something for 10 pound last week and 10 pounds yesterday I would expect them to show up twice in the output:

John, Smith, 2009-11-30, 10
John, Smith, 2009-12-06, 10

Re:Guessing...

link on 2009-12-07T09:30:08

I think if you want some information about customers it is probably a bad idea not grouping by customer id. Depending on your site you may or may not have 2 customers with the same name but you will almost always have the occasional repeat customer.

Actually , looking at the code I would assume I misunderstood the requirements as I don't see why order_date is there at all.

Re:Guessing...

Ovid on 2009-12-07T09:48:59

I should have removed order_date. It was present in the first version of the SQL I wrote.

sum(null) is null, not zero

mscolly on 2009-12-07T09:42:49

"price" I presume is from the order_items table. If there are no orders and hence no order_items, the "price" field is a null (from the outer joins). sum(null) is null, and null is not less than 15, the comparison is null. This means you will not see the customers with no (completed) orders.

Re:sum(null) is null, not zero

mscolly on 2009-12-07T09:48:03

Even worse, if a client has completed orders, and at least one incomplete order, this gives a null price, and the sum will be null again.

Re:sum(null) is null, not zero

Ovid on 2009-12-07T10:07:08

What do you mean by "incomplete order"? I just added an order for a customer who already had orders, but did not add order items for it. The query results were the same, so I think I may have misunderstood you.

Re:sum(null) is null, not zero

mscolly on 2009-12-07T13:24:49

An "orders" record without corresponding "order_items" record(s) would be an "incomplete" order.

Regarding your example, the result depends on whether there were other "complete" orders for the customer or not, and how your database treats nulls. Since a null means "totally unknown", anything plus null results in null, and comparing null with 15 is also null (neither true nor false). You should be using sum(coalesce(price,0)) for your definition of the field total.

try fixing your query

jsievert on 2009-12-07T21:04:15

SELECT first_name,
                    last_name,
                    order_date,
                    SUM(COALESCE(price, 0)) AS total
FROM customer
LEFT JOIN orders ON customer.id = orders.customer_id
LEFT JOIN order_item ON orders.id = order_item.order_id
GROUP BY first_name, last_name, order_date
HAVING total 15
ORDER BY order_date ASC

there, fixed it for you. any math on a NULL needs to have the ability to deal with a non-NULL value. there is a standard SQL function just for this very issue.

group by customer ID

kbo on 2009-12-08T00:37:05

I believe you should be grouping by customer_id rather than first/last name... otherwise you could be combining results for two John Smiths that order on the same date.

Re:group by customer ID

Ovid on 2009-12-08T06:52:15

Ouch. Two bugs. Thanks :)