Unix schedular considered inadequate

scrottie on 2005-03-28T18:18:53

The Unix schedular and the virtual memory system use conflicting algorithms and fight with each other. The schedular favors interactive processes that require CPU only in bursts. The virtual memory system favors programs that demand lots of RAM. The schedular punishes long running CPU pigs. The virtual memory system punishes interactive programs that only run now and then. As a result, a Unix system fights with itself and the result is a loaded system with 256 megs of RAM has as bad of interactive performance as a 20 year DEC 3100 with 8 megs of RAM - running the same version number of the same operating system.

The virtual memory system and its page file is critcal to performance a system with too little memory, such as a heavily loaded server or a workstation running the Mozilla browser which leaks memory faster than any other single Unix application I've ever seen in my entire life except for the RealMedia Server which is infamous in its leakiness.

Let me give you an example. The other day, I was at pantsfactory.org, a computer security blog an old friend of mine helps run. They had a link with the text Dowd IM'd me this one as a heads up... hope you're not using Firefox just because IE sucks so bad!. Mindlessly, I click on the link. My HD starts to buzz and grind. I stare blankly at my system for a few moments, not having had my morning coffee yet. The grinding continues. Slowly, in my head, I start to add two and two: I had just clicked a link to an exploit for the browser I'm running and all of my files are being deleted. The system had been grinding for 30 seconds, non-responsive. I double click the x-term I'd been using 30 seconds ago, and the window manager draws its border but the xterm doesn't repaint itself (not a fancy xterm - just the little tiny xterm that comes with X11). More time pases. I think about pulling the plug and yanking out the battery but I have a load of unsaved work, too! Finally, the xterm draws in, and I killall Mozilla and start looking at my files (the system grinds for a solid minute after I kill Mozilla as Linux reclaims the memory). Nothing seems to be gone. All of my files seem to be there. I use Dillo to look at the page that had caused the mess: it's perfectly innocent. Mozilla had simply been running too long and had built up a half a gigabyte of allocated VM. Opening that first link of the day pushed out the RAM for most of X, all of the other programs I was running, the program pages for bash, xterm, and everything else - except for Mozilla, parts of which it was feveriously throwing around trying to get them all to fit into a memory a quarter the size of the crap the program had allocated.

Bloatware like Mozilla uses a large number of memory pages and uses them frequently, so, as long as smaller applications aren't actively running, they quickly get swapped out to disc. Once an application like xterm+bash is issued a command or asked to redraw, it gets CPU because it has a higher priority (actually, a low priority is a high priority, but I'm using the logical sounding terms rather than the correct ones). xterm+bash get more and more of their pages swapped in, slowly, over time, as pages get swapped back out to disc because they haven't used them frequently because they're waiting for other pages. This is like filling a leaky bucket. Then bash+xterm begin to respond - a full minute later. Comparing the performance of NetBSD/current on a reasonably well equiped machine made two years ago versus one made 20 years ago, it takes about the same amount of time for the system to become responsive to interactive input after a memory bound application was allowed to run for a minute. In both cases, the footprint of the software to run (parts of X, xterm, and bash) take the same amount of memory - less than 8 megs but it takes over a minute of having parts swapped in and then back out because a third part wasn't swapped in and xterm+bash waited too long for it to swap in. NetBSD has a similar VM strategy to Linux. .

Favoring interactive processes by the schedular doesn't make long running applications take longer to run - it simply makes interactive applications more responsive, VM permitting. However, if the virtual memory system were to favor interactive processes, the system would have to do a larger total amount of paging and long running applications would take longer. However however, disc bound applications quite frequently are run in error and giving undue amounts of VM resources to them only wastes time as the programmer or admin tries to kill them. While the Unix machine just wants to run its task as quickly as possibily, admins and programmers often want the system to just stop, damn it!. Other applications and users are starved for CPU and service while we struggle to kill the run-away process. In general, the system would run more slowly but more efficiently if interactive applications were favored on the CPU.

There are two topics here:
1. Conflict in design between the schedular and VM system causes pathological behavior that greatly and unnecessarily increases swapping.
2. Superficial solutions exist, but to truly solve the problem, the system must be optimized for either batch or interactive processes, not an indecisive mix, and I argue for favoring interactive performance.

Yes, I'm really suggesting that the pig Mozilla sit there and struggle with 8 fewer megs while lean and mean xterm+bash move quickly. As for implementation, using a most-frequently-used algorithm rather than most-recently-used would require a little more CPU but would radically speed both interactive and batch tasks up as swapping is done more intelligently. A small, lean application like xterm uses only a few hundred pages during normal use, but it uses those pages extremely frequently. A large application like Mozilla uses millions of pages, and it uses hundreds of thousands of those pages fairly constantly, but it uses those far less frequently than xterm+ bash use their hot path code pages. Using the MMU, Unix like systems can tell whether a page has been read. They can't tell how many times the pages have been accesses - merely whether they've been accessed since the last time the counters were reset. On some platforms, this is implemented in hardware as bits in the processes page tables. On other platforms, it is emulated by the OS by marking the pages protected and then catching the page fault when a read is attempted. Right now, most platforms will do a few passes like this. Two passes is enough to divide all of the pages into two categories: those that have been accessed in the last time period (in microseconds) and those that haven't. This is far too crude to make intelligent decisions about something so expensive such as which pages to move to disc. (Note to self: this discussion should be moved up, perhaps before the story.) Three passes would better indicate which pages tend, with very approximate statistic accuracy, to be used more often than others. Four passes would improve the accurancy. Five passes would be even better. Allowing more than one pass implies that a byte or more of overhead per page (4k or 8k of physical RAM, usually) would have to be allocated non-swappable to store information about the frequency of use of the page. At this point, a decaying average of usage frequency might as well be used, and then usage information as far back as one hundred ticks would influence the frequency usage score for each page. Then, when the VM system swaps a page out, it is swapping out a page that is statistically the least likely be used again in the near future.

Intelligent comments, please. I'd like to make this an Advogato article.

-scott

Partial Mozilla Solution

chromatic on 2005-03-28T19:49:40

I run a single-user Squid instance on my workstations and disable the Mozilla and Firefox memory caches. That seems to trim their memory growth. Perhaps it will help you.

Writing is interesting

scrottie on 2005-03-29T18:07:12

This writing this is interesting. I've yet to feel like I've managed to make my point while doing it. Perhaps my points are so frail as to break down under the scruntiny of explanation. Perhaps I need to learn to be concise rather than try so hard to "prove" my point.

Novices often suspect Unix priorities don't really do anything - setting an interactive session to the lowest (maximum) priority doesn't actually make it more responsive. While such novices eventually learn why this happens and come to accept it, I suggest this behavior is wrong and should not be accepted.

Virtually all systems have excellent interactive performance when running only interactive applications out of RAM, but fewer maintain good interactive performance while running both batch applications and interactive applications. Unix performs extremely poorly in this scenario. Running Mozilla for a week and running an xterm at the same time tickles just as well as traditional batch applications along side interactive sessions.

The schedular favors processes that don't attempt to use the maximum amount of CPU available to them - however, these processes are almost never runnable because the virtual memory system recgonizes them as not recently having used their pages if they idle for even a fraction of a second, and the VM system swaps them out to make room for processes actually running at a lower priority. Worse, when an interactive application does attempt to use some CPU, its pages are swapped out almost as fast as they are swapped in - pages are litteraly go cold and are swapped back out while it waits for other pages to be swapped in. This manifests itself as applications taking an excessively long time for their size to page back in to memory.

In summary, the algorithms employed in Unix are not only excessively simplistic but actually clash with each other and perform pathologically badly.

-scott