MySQL bulk loading techniques

perrin on 2007-05-14T22:35:43

I'm working on the next generation of the data warehouse described here by Sam Tregar. This time, I'm trying to keep it up-to-date in near real-time. Because of this, I have several new constraints on how I load data:

  • The tables are all InnoDB. This is necessary because there will be long-running queries running on these tables while data is being loaded, and that requires the MVCC support in InnoDB. MyISAM tables would block updates while anyone is reading. Incidentally, contrary to what people often claim on Slashdot, converting these tables to InnoDB improved query performance quite a bit. Only the loading speed suffered compared to MyISAM tables.
  • We can't disable the indexes while loading. This is a huge help in the current system where we load into an off-line database. The ALTER TABLE DISABLE KEYS and ALTER TABLE ENABLE KEYS commands allow the indexes to be rebuilt in bulk. If we did this to an on-line table though, anyone using it would suddenly have no indexes available. Also, InnoDB doesn't have the same bulk index creation optimization, although this is supposed to be coming soon.
  • The incoming data will be a mix of new rows and updates to existing rows.
  • Some of the loads will be partial data for a table, i.e. not all data loads cover all columns in the target table.

So, I had a few ideas of ways to load the data and wanted to see what would give me the best results. I made a quick and dirty benchmark script and tried them out on a relatively small table (~50K rows) which I loaded with 20K rows and then tested ways of copying the full data in, meaning a combination of new rows and updates. Here are the results.

The fastest approach is an INSERT...SELECT with an ON DUPLICATE KEY UPDATE clause. That looks a bit like this:

INSERT INTO foo_test SELECT * FROM foo ON DUPLICATE KEY UPDATE bar=VALUES(bar), baz=VALUES(baz),...

This was pretty fast, coming in at 29 seconds. Some people have trouble with the INSERT...SELECT because it takes a shared lock (like SELECT...FOR UDPATE) on the source table while it runs. This is apparently fixed in MySQL 5.1 by using row-based replication. It's also not really an issue for us because we're doing this work on a replicated database, so the worst case is that the replication falls behind a bit while the statement runs.

Although it won't work for us, I tried REPLACE as well, just to see how it compared. It was quite a bit slower, coming in at 54 seconds, or almost twice as long.

I considered trying a combination of INSERT IGNORE...SELECT and a bulk UPDATE (using a join), but figured this would do poorly if the SELECT had any real work in it, since it would be running twice.

The most common workaround for people who have trouble with the INSERT...SELECT locking is to use a temporary file with SELECT INTO OUTFILE and LOAD DATA INFILE. I tried that next. The dump is really fast, taking only 1 second. Loading is complicated by the fact that you can't do updates with LOAD DATA INFILE, so I decided the best thing would be to load the data into a temporary table and then do an INSERT...SELECT from that.

I got that load to go very quickly by making my temp table a MyISAM one and running an ALTER TABLE DISABLE KEYS on it before loading. It loaded in 3 seconds. Then I did the same INSERT...SELECT from the temp table which took the same 29 seconds (and I never built the indexes because I didn't need them). In total, the temp file only added 4 seconds or 14% overhead. This seems like a good solution for people who run into locking issues.

Then I tested using two database handles, one to SELECT and one to INSERT/UPDATE, pumping the data from one to the other. I was pretty sure I couldn't beat the INSERT...SELECT with this approach, but we have some situations where we need to process every row in perl during the load, such as geocoding addresses or applying logic that gets too ugly when done in SQL.

I played around with the frequency of commits and with MySQL's multi-row INSERT extension, and got this reasonably fast. It ran in 43 seconds, or a bit less than 50% slower than the INSERT...SELECT.

Looking at how fast the LOAD DATA INFILE was, I tried a different approach for processing every row, doing a SELECT and writing the rows out with Text::CSV_XS. Then I loaded that file into a temp table with LOAD DATA INFILE and did an INSERT...SELECT from the temp table as before.

This was much better. Dumping the SELECT with Text::CSV_XS only took 3 seconds and, combined with the 4 second load, it only adds 24% overhead and gives me a chance to work on every row in perl. It's also much simpler to code than the multi-row INSERT stuff.

I should point out that working with these large data sets row by row requires the "mysql_use_result" option, which makes the server spool results to the client instead of dumping them all at once. I activate it for specific statement handles like this:

my $sth = $dbh->prepare($sql, {mysql_use_result => 1});

If anyone has additional ideas, I'd be interested in hearing them. For now, I'm happy with the first approach for updates that can all be done in SQL and the last one for updates that require perl processing on every row.


Great article for future reference

Ron Savage on 2007-05-17T00:48:19

$many x $thanx for the effort.