You don't need to scale

autarch on 2008-10-11T16:01:00

Programmers like to talk about scaling and performance. They talk about how they made things faster, how some app somewhere is hosted on some large number of machines, how they can parallelize some task, and so on. They particularly like to talk about techniques used by monster sites like Yahoo, Twitter, Flickr, etc. Things like federation, sharding, and so on come up regularly, along with talk of MogileFS, memcached, and job queues.

This is lot like gun collectors talking about the relative penetration and stopping power of their guns. It's fun for them, and there's some dick-wagging involved, but it doesn't come into practice all that much.

Most programmers are working on projects where scaling and speed just aren't all that important. It's probably a webapp with a database backend, and they're never going to hit the point where any "standard' component becomes an insoluble bottleneck. As long as the app responds "fast enough", it's fine. You'll never need to handle thousands of request per minute.

The thing that developers usually like to finger as the scaling problem is the database, but fixing this is simple.

If the database is too slow, you throw some more hardware at it. Do some profiling and pick a combination of more CPU cores, more memory, and faster disks. Until you have to have more than 8 CPUs, 16GB RAM, and a RAID5 (6? 10?) array of 15,000 RPM disks, your only database scaling decision will be "what new system should I move my DBMS to". If you have enough money, you can just buy that thing up front.

Even before you get to the hardware limit, you can do intelligent things like profiling and caching the results of just a few queries and often get a massive win.

If your app is using too much CPU on one machine, you just throw some more app servers at it and use some sort of simple load balancing system. Only the most brain-short-sighted or clueless developers build apps that can't scale beyond a single app server (I'm looking at you, you know who).

All three of these strategies are well-known and quite simple, and thus are no fun, because they earn no bragging rights. However, most apps will never need more than this. A simple combination of hardware upgrades, simple horizontal app server scaling, and profiling and caching is enough.

This comes back to people fretting about the cost of using things like DateTime or Moose.

I'll be the first to admit that DateTime is the slowest date module on CPAN. It's also the most useful and correct. Unless you're making thousands of objects with it in a single request, please stop telling me it's slow. If you are making thousands of objects, patches are welcome!

But really, outside your delusions of application grandeur, does it really matter? Are you really going to be getting millions of requests per day? Or is it more like a few thousand?

There's a whole lot of sites and webapps that only need to support a couple hundred or thousand users. You're probably working on one of them ;)

Cross-posted from House Absolute(ly Pointless) - permalink.


but speed is important...

gabor on 2008-10-11T21:37:14

...I hear often when I am telling people that is not the most important thing.
as if those two sentences are contradicting.

IMHO These people usually come from some C/C++ background or have been working on code where the someone (maybe they themselves) wrote an algorithm that was O(2^n) instead of O(n^2) or even O(n).

In the distant past I too had once a mistake writing an O(2^n) algorithm in LISP that blew up somewhere around n=3 or so. I had to go back and fix it. Since then I am thinking more about such issues but still I'd rather have something working first and then profiling it than constantly be nervous about the speed issues.

In an unrelated note, one of the biggest improvements I usually get from databases is creating the right indexes. It is surprising how many times developers forget that.

Re:but speed is important...

kaare on 2008-10-13T12:13:19

The disease is called 'Premature optimization' and is the root of all evil according to Donald Knuth.

Regarding indexes, you can also have too many. I once had a task to find out why a page in RT took 6 seconds to load.

It turned out there were two almost identical indexes, one with two columns, the other with three, an extra column added at the end.

Removing one of the indexes cut the load time of the page to approximately ½ a second. Now I don't know excatly why that was a problem for the DBS, but hey, it was Mysql!