OSCON Day 3: Scalable Computing with MapReduce

acme on 2005-08-04T00:11:57

After a nice break with peanut butter brownies, I attended Scalable Computing with MapReduce by Doug Cutting. Doug explained the Nutch project (which has a nice orange logo) and where it started failing to scale (~100M pages). Nutch waanted a distributed file system, waited for a bit and then handily Google published a paper on the Google File System, which led to the Nutch distributed file system, maintained as a package in Nutch. Of course then they needed a distributed computing platform, then Google announced MapReduce, leading to Nutch mapReduce. He goes into a little bit of detail in how they work and notes that MapReduce isn't ideal for all computations Nutch wanted to do, but it's not far off, which is why they added some minor extensions like an async map and partitioning by value. Then he dug into a few nice code examples similar to the Google paper and revealed that he's playing with a 40 node Capricorn rack (a neat spin off from the Internet Archive and hopes to demonstrate a 1B page index soon An interesting talk.

Best quote: "I'm afraid somebody is going to criticise me someday, but I'm open to that."