how large sites scale their databases

perrin on 2006-04-28T19:06:46

I've been following Tim O'Reilly's series on how large sites scale their databases. Also, this article about topix.net. They seem to fall into two camps:

  1. Using flat-files, typically accompanied by lots of attitude about how much smarter they are for not using an RDBMS and frequent invocations of Google.
  2. Using MySQL, with replication to scale reads, and data partitioning to scale writes (users A-H on this cluster, I-P on that one...)

Amazingly, Craig's List uses MyISAM tables. I guess it's nearly all reads, but I just didn't think the locking approach used for MyISAM tables would hold up to traffic like that. A primary reason why I use InnoDB is the row-level locking and the multi-version concurrency system, which means that readers don't block writers.

Two interesting things here are that none of them use PostgreSQL, despite a few of them being fairly new, and that none of them have tried commercial offerings for database clustering, like the stuff IMB and Oracle sell.

In fact, I've never met anyone who had tried the Oracle or DB2 clustering. Even the people who have the money seem to avoid it. Can anyone offer any personal anecdotes about it? Does it work at all?


Clustering

mrjoltcola on 2006-05-01T14:06:57

In fact, I've never met anyone who had tried the Oracle or DB2 clustering. Even the people who have the money seem to avoid it. Can anyone offer any personal anecdotes about it? Does it work at all?

Sure it works. The reason you don't hear a lot of anecdotes is that most of the people who can afford it aren't out telling the public how they do it.

Oracle "clustering" is probably used way more in internal, critical infrastructure, than in external, disposable content servers.

One of Oracle's advantages is a lot of the high-availability features are builtin, or relatively seamless, whereas products like Postgres require a lot of work to implement the HA that they supposedly support.

Pg for websites

Phred on 2006-05-02T08:53:23

Two interesting things here are that none of them use PostgreSQL, despite a few of them being fairly new

I agree here, PostgreSQL is not popular with scaling large websites. It's strengths are not well suited to that task. It is not nearly as fast as MySQL on reads, and is not as friendly as MySQL to setup for web developers. It is the hidden P in LAMP (although my version of LAMP is Linux Apache Mod_perl Postgresql).

PostgreSQL is best suited for applications which require higher than 10 to 1 ratio or reads to writes. InnoDB provides a form of MVCC and locking, but imho the locking in PostgreSQL is much more robust and performant (and I'm not here to engage in the details of the differences with regards to the two). I've often seen MySQL and PostgreSQL as potentially complimentary systems, and using dbi-link they can function as such - one for reads and one for writes.

I read through the Oreilly articles and it seems like most of the situations are pushing read heavy scenarios. That's the point at which you need to implement the hybrid solution that not a lot of people talk about. Craigslist gets by on myisam through a delayed insert approach (post to craigslist and you will see for yourself), and I'm guessing that the other scenarios have a delayed update scenario as well. That's the hard part - real time updates. It's something still relegated to the domain of web applications. Websites can still get by with delayed updates as the content is mostly static, and in any large deployment you don't want to depend on the database for the availability of your content.

Fun

acme on 2006-05-02T09:17:23

I did find it very interesting to have technical reports of how these popular companies try to scale. The flat files are slightly more scary than MySQL, which is a known commodity at least. I am surprised that replication and backups are still so hard to pull off - it should be all plug and play already. I thought this was the future!