As I mentioned in my last post, by day I wrangle a large web application that occasionally verges on being too complex to for mere humans to understand.
Curiously, because it is private and the income rate is high (we probably average something like $5 in gross turnover per page view) we don't have to deal with a lot of servers to deliver it, by internet standards anyway.
But the application is still way too big to profile easily by normal methods, and certainly on production it's way too heavy, even if we applied targeted profiling using something like Aspect::Library::NYTProf.
Between the web servers, transaction server, database, search engine and cache server, we are probably only dealing with 12 servers and 30 CPUs. Of course, these servers are all horribly expensive, because they are all server-virtualised, network-virtualised, doubly redundant (high-availability + disaster-recovery) and heavily monitored with high end support contracts.
One of our most sensitive issues is database load.
We have a ton of database tables (about 200) and lots of medium sized queries running across them. One of our main failure modes is that some deep change to code boosts the quantity of some significant query, which stresses the database enough to cause contention and lock-storms, leading to site failure.
Complicating things, big parts of some pages are embedded in other pages. So attributing load and lag to one part of the website, or to Oracle, is tricky and hard to benchmark in advance (although we do load test the main load paths to catch the worst cases).
For a long time, we've had a mechanism for zoned profiling the production site, so we can allocate wallclock costs to different general areas of the site.
But it is fragile and unreliable, requiring perfect maintenance and every developer to remember to write this kind of thing everywhere.
# Embed a foo widget in the current page $perf->push_timing('foo'); foo($bar); $perf->pop_timing;Since you don't know this profiling system exists unless you've seen it somewhere else in the code before, and it's hard to care about something that is orthogonal to the problem you are actuall solving, this mechanism has degraded over time. While we still get some pretty cacti graphs showing load breakdown, they are highly unreliable and you can never be entirely sure if the load attribution is correct.
use Aspect;This example breaks out the cost of a typical small web application into a general zone, a special zone for the search page, and then splits the costs of generating widgets, rendering the HTML template, waiting for the database, and making calls to the underlying system.
aspect ZoneTimer => ( zones => { main => call 'MyProgram::main' & highest, search => call 'MyProgram::search', widgets => call qr/^MyProgram::widget_.*/, templates => call 'MyProgram::render', dbi => call qr/^DB[DI]::.*?\b(?:prepare|execute|fetch.*)$/, system => ( call qr/^IPC::System::Simple::(?:run|runx|capture)/ | call 'IPC::Run3::run3' | call qr/^Capture::Tiny::(?:capture|tee).*/ ) }, handler => sub { my $top = shift; # "main" my $start = shift; # [ 1279763446, 796875 ] my $stop = shift; # [ 1279763473, 163153 ] my $exclusive = shift; # { main => 23123412, dbi => 3231231 } print "Profiling from zone $top\n"; print "Started recording at " . scalar(localtime $start) . "\n"; print "Stopped recording at " . scalar(localtime $stop) . "\n"; foreach my $zone ( sort keys %$exclusive ) { print "Spent $exclusive->{$zone} microseconds in zone $zone\n"; } }, );
Thanks Adam,
this is great work. I'm eager to try it on our systems. One of the most difficult things for me is to predict system behaviour when rolling out changes to the entire cluster instead of just, say, one box.
On one of our largest systems, when there's a potentially critical change, we roll it out on one box first, then 2/3, and if we don't sense any dramatic changes, we deploy to the full cluster. As I said, though, sometimes this is not enough.
The only strategies I can think of are:
1) randomly enabling the new feature/change for a sample of the users (either A/B testing or rand(x)>y)
2) setup an independent parallel staging cluster, and replicate a near-production load. Not easy, and requires lots of resources.
Do you have any war story about that?