I'm working on the next generation of the data warehouse described here by Sam Tregar. This time, I'm trying to keep it up-to-date in near real-time. Because of this, I have several new constraints on how I load data:
ALTER TABLE DISABLE KEYS
and ALTER TABLE ENABLE KEYS
commands allow the indexes to be rebuilt in bulk. If we did this to an on-line table though, anyone using it would suddenly have no indexes available. Also, InnoDB doesn't have the same bulk index creation optimization, although this is supposed to be coming soon.
So, I had a few ideas of ways to load the data and wanted to see what would give me the best results. I made a quick and dirty benchmark script and tried them out on a relatively small table (~50K rows) which I loaded with 20K rows and then tested ways of copying the full data in, meaning a combination of new rows and updates. Here are the results.
The fastest approach is an INSERT...SELECT with an ON DUPLICATE KEY UPDATE clause. That looks a bit like this:
INSERT INTO foo_test
SELECT * FROM foo
ON DUPLICATE KEY UPDATE
bar=VALUES(bar), baz=VALUES(baz),...
This was pretty fast, coming in at 29 seconds. Some people have trouble with the INSERT...SELECT
because it takes a shared lock (like SELECT...FOR UDPATE
) on the source table while it runs. This is apparently fixed in MySQL 5.1 by using row-based replication. It's also not really an issue for us because we're doing this work on a replicated database, so the worst case is that the replication falls behind a bit while the statement runs.
Although it won't work for us, I tried REPLACE
as well, just to see how it compared. It was quite a bit slower, coming in at 54 seconds, or almost twice as long.
I considered trying a combination of INSERT IGNORE...SELECT
and a bulk UPDATE
(using a join), but figured this would do poorly if the SELECT
had any real work in it, since it would be running twice.
The most common workaround for people who have trouble with the INSERT...SELECT
locking is to use a temporary file with SELECT INTO OUTFILE
and LOAD DATA INFILE
. I tried that next. The dump is really fast, taking only 1 second. Loading is complicated by the fact that you can't do updates with LOAD DATA INFILE
, so I decided the best thing would be to load the data into a temporary table and then do an INSERT...SELECT
from that.
I got that load to go very quickly by making my temp table a MyISAM one and running an ALTER TABLE DISABLE KEYS
on it before loading. It loaded in 3 seconds. Then I did the same INSERT...SELECT
from the temp table which took the same 29 seconds (and I never built the indexes because I didn't need them). In total, the temp file only added 4 seconds or 14% overhead. This seems like a good solution for people who run into locking issues.
Then I tested using two database handles, one to SELECT
and one to INSERT/UPDATE
, pumping the data from one to the other. I was pretty sure I couldn't beat the INSERT...SELECT
with this approach, but we have some situations where we need to process every row in perl during the load, such as geocoding addresses or applying logic that gets too ugly when done in SQL.
I played around with the frequency of commits and with MySQL's multi-row INSERT
extension, and got this reasonably fast. It ran in 43 seconds, or a bit less than 50% slower than the INSERT...SELECT
.
Looking at how fast the LOAD DATA INFILE
was, I tried a different approach for processing every row, doing a SELECT
and writing the rows out with Text::CSV_XS. Then I loaded that file into a temp table with LOAD DATA INFILE
and did an INSERT...SELECT
from the temp table as before.
This was much better. Dumping the SELECT
with Text::CSV_XS only took 3 seconds and, combined with the 4 second load, it only adds 24% overhead and gives me a chance to work on every row in perl. It's also much simpler to code than the multi-row INSERT
stuff.
I should point out that working with these large data sets row by row requires the "mysql_use_result" option, which makes the server spool results to the client instead of dumping them all at once. I activate it for specific statement handles like this:
my $sth = $dbh->prepare($sql, {mysql_use_result => 1});
If anyone has additional ideas, I'd be interested in hearing them. For now, I'm happy with the first approach for updates that can all be done in SQL and the last one for updates that require perl processing on every row.