I've been following Tim O'Reilly's series on how large sites scale their databases. Also, this article about topix.net. They seem to fall into two camps:
Amazingly, Craig's List uses MyISAM tables. I guess it's nearly all reads, but I just didn't think the locking approach used for MyISAM tables would hold up to traffic like that. A primary reason why I use InnoDB is the row-level locking and the multi-version concurrency system, which means that readers don't block writers.
Two interesting things here are that none of them use PostgreSQL, despite a few of them being fairly new, and that none of them have tried commercial offerings for database clustering, like the stuff IMB and Oracle sell.
In fact, I've never met anyone who had tried the Oracle or DB2 clustering. Even the people who have the money seem to avoid it. Can anyone offer any personal anecdotes about it? Does it work at all?
Sure it works. The reason you don't hear a lot of anecdotes is that most of the people who can afford it aren't out telling the public how they do it.
Oracle "clustering" is probably used way more in internal, critical infrastructure, than in external, disposable content servers.
One of Oracle's advantages is a lot of the high-availability features are builtin, or relatively seamless, whereas products like Postgres require a lot of work to implement the HA that they supposedly support.
Two interesting things here are that none of them use PostgreSQL, despite a few of them being fairly new
I agree here, PostgreSQL is not popular with scaling large websites. It's strengths are not well suited to that task. It is not nearly as fast as MySQL on reads, and is not as friendly as MySQL to setup for web developers. It is the hidden P in LAMP (although my version of LAMP is Linux Apache Mod_perl Postgresql).
PostgreSQL is best suited for applications which require higher than 10 to 1 ratio or reads to writes. InnoDB provides a form of MVCC and locking, but imho the locking in PostgreSQL is much more robust and performant (and I'm not here to engage in the details of the differences with regards to the two). I've often seen MySQL and PostgreSQL as potentially complimentary systems, and using dbi-link they can function as such - one for reads and one for writes.
I read through the Oreilly articles and it seems like most of the situations are pushing read heavy scenarios. That's the point at which you need to implement the hybrid solution that not a lot of people talk about. Craigslist gets by on myisam through a delayed insert approach (post to craigslist and you will see for yourself), and I'm guessing that the other scenarios have a delayed update scenario as well. That's the hard part - real time updates. It's something still relegated to the domain of web applications. Websites can still get by with delayed updates as the content is mostly static, and in any large deployment you don't want to depend on the database for the availability of your content.