GAAA!!!
Yesterday I got a "funny" bug report: If the user 'märrie' logs in, she gets the account of user 'marrie'. As the system in question gets it's data from a combination of different sources (SAP, a semi-external single sign on system, some local data) I expected some hellish encoding problems (which would be strange, because we switched the whole system over to utf8 two years ago).
So I start my bug-hunt at the lowest level and connect to the MySQL DB. I do a quick "select username from users where username = 'märrie"' (just for basic sanity checking) and get back:
marrie
märrie
WTF??
To cut a long bug-hunt short: MySQL considers 'ä' and 'a' to be the same character (at least when using default utf8 settings) - not only for sorting, where this makes a little bit sense, but also for selecting, which is totally pointless.
The solution: use a collate of 'utf8_bin':
mysql> SELECT 'ä' = 'a' COLLATE utf8_general_ci;
+------------------------------------+
| 'ä' = 'a' COLLATE utf8_general_ci |
+------------------------------------+
| 1 |
+------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT 'ä' = 'a' COLLATE utf8_bin;
+-----------------------------+
| 'ä' = 'a' COLLATE utf8_bin |
+-----------------------------+
| 0 |
+-----------------------------+
1 row in set (0.00 sec)
/me hates MySQL!
$ cd projects/MahDayJobb/
$ cat sql/schemapatch-00039.sql
ALTER DATABASE CHARACTER SET = 'utf8' COLLATE = 'utf8_unicode_ci';