Some day I need to write up a "here's something cool we did with AxKit/Perl" about MessageLabs' spam quarantine system (called "Spam Manager"). But suffice it to say for now that we did it with AxKit and Perl, and it is very cool.
However we're experiencing some very "interesting" problems with it to do with load. We're seeing the load go up on some servers while the CPU usage sits at no more than 5%.
Of course load average isn't tied to CPU usage. But most people see the load go up to more than 1.0 when their CPU is fully utilised. Load is a measure of runnable processes - although that's terribly poorly explained practically everywhere (I did find a good explanation of it but lost the link - so if you don't know what load average really means I can't help you :-).
So this is something to do with the kernel not being able to context switch in processes fast enough. Usually we've managed to tie this down to bad duplex settings on the network interface (half duplex instead of full duplex). However recently we've seen the problem again with the network interface being just fine.
Debugging this is practically impossible - it's not repeatable or isolate-able. I welcome any tips from anyone here who has experience with this. My next port of call is to look at the SQL Server that the box is connected to, and see if that has any relevance, but I don't hold much hope to find anything out. I'm kinda stuck.
Never used it myself, but it looks like the right kind of tool...
-Dom
Re:Load isn't a complicated concept.
Matts on 2004-05-13T07:18:44
"Is that a good enough explanation?:-)"
It's not bad:-)
There's a lot more detail at the link I was harping on about in my post, which I decided to go and find again so I could bookmark it this time. It's Here.