MySQL Madness (ma, get my gun!)

Ovid on 2007-01-22T18:40:32

So I was debugging a MySQL problem and quickly hacked out the following query. It took a long time to run before finally giving me an 'OUT OF MEMORY' error:

SELECT distinct(ip)
FROM   foo
WHERE  ip NOT IN (
  SELECT distinct(ip)
  FROM   bar
)

See a problem with that query? The problem I have is one you can't see. A coworker pointed out a typo. There's a bar.id column, but not a bar.ip column. MySQL didn't catch that. It was apparently trying to run the outer query, collect all results and then parsing and filtering on the subquery.


Oracle

djberg96 on 2007-01-22T19:56:56

The folks at Oracle realized that antijoins were often performance bottlenecks and provided hints to specifically to speed them up, e.g. HASH_AJ, and specifying the optimizer type (RULE vs COST) can make a huge difference.

I'm curious what the point of the 'distinct' in the subquery is, btw. I think that would make the explain plan worse in Oracle. But I digress.

For stuff like this in Oracle, I generally opt for the simple minus query. Reasonable performance and easy to understand.

How do the following compare in MySQL?

  • Anti-join with NOT IN
  • Anti-join with NOT EXISTS
  • Anti-join using MINUS
  • Anti-join using outer join.

Re:Oracle

Ovid on 2007-01-22T20:17:10

I couldn't really say how those others compare. I'm not a performance guru. I've traditionally gone more for the normalization aspect and not worried as much about the implementation, trusting the DBAs to be good at their jobs. Now I'm finding I need to focus on the implementation more and more.

As for the 'distinct' clause in the subquery, I really thought there might be duplicate ip addresses in that table and I had no idea if MySQL would optimize those away or force every value to be compared multiple times. I should look at the 'NOT IN' behavior more to see how it handles bags instead of sets.