Load

Matts on 2004-05-12T19:45:49

Some day I need to write up a "here's something cool we did with AxKit/Perl" about MessageLabs' spam quarantine system (called "Spam Manager"). But suffice it to say for now that we did it with AxKit and Perl, and it is very cool.

However we're experiencing some very "interesting" problems with it to do with load. We're seeing the load go up on some servers while the CPU usage sits at no more than 5%.

Of course load average isn't tied to CPU usage. But most people see the load go up to more than 1.0 when their CPU is fully utilised. Load is a measure of runnable processes - although that's terribly poorly explained practically everywhere (I did find a good explanation of it but lost the link - so if you don't know what load average really means I can't help you :-).

So this is something to do with the kernel not being able to context switch in processes fast enough. Usually we've managed to tie this down to bad duplex settings on the network interface (half duplex instead of full duplex). However recently we've seen the problem again with the network interface being just fine.

Debugging this is practically impossible - it's not repeatable or isolate-able. I welcome any tips from anyone here who has experience with this. My next port of call is to look at the SQL Server that the box is connected to, and see if that has any relevance, but I don't hold much hope to find anything out. I'm kinda stuck.


OProfile

Dom2 on 2004-05-12T21:23:46

Maybe it's worth installing OProfile to try and find out what's happening?

Never used it myself, but it looks like the right kind of tool...

-Dom

Load isn't a complicated concept.

btilly on 2004-05-12T23:33:33

But the only good definition is the technical definition. When the scheduler says, "I have a timeslice to hand out", how many processes are lined up, ready to take that slice (on average)?

It means nothing more, and nothing less.

What matters after understanding that is that there is no simple intuitive understanding of what that means. If you have a single CPU-bound process, it always wants a timeslice, and will contribute 1 to your load. That process may have priority 20 and lose to everything else, so the system is just as fast, but your load average still went up by 1. Or that process may just be doing a busy wait - the equivalent of a kid continually asking, "Are we there yet?" Doesn't matter how useless it is, it raised the load average by 1.

You could have 10 I/O bound processes that tie up resources, locks, RAM, and other critical stuff. Since they are always waiting for something, they don't contribute to load no matter how much they are stressing out your system.

Depending on what constitutes that load, a load of 0.9 could mean that your system is a hair away from melting down, or a load of 40 could mean that you're doing OK.

For all intents and purposes, load is a meaningless number. It is easy to measure, so people always look at it. But unless you understand what your machine is doing and why it is registering load, there is no way to tell what any particular figure means.

Is that a good enough explanation? :-)

Re:Load isn't a complicated concept.

Matts on 2004-05-13T07:18:44

"Is that a good enough explanation? :-)"

It's not bad :-)

There's a lot more detail at the link I was harping on about in my post, which I decided to go and find again so I could bookmark it this time. It's Here.