Machine "Recovery"

pjf on 2005-06-07T09:12:01

Machine "Recovery"
Machines should be designed with consideration for when things go wrong. Now, let's pretend that you have an "appliance". It has an LCD screen, rather than a keyboard and monitor. If you're lucky, console on serial has been enabled. It has "no user servicable parts inside", and while it runs Linux underneath, it's primarily marketed at non-Linux gurus.

You're picking a filesystem to place on one of these "appliances". Do you choose a filesystem like ext2/ext3, which is used by practically every Linux system worldwide, has good stability, and excellent recovery tools? Or do you choose something else?

For these appliances, a different filesystem was used. This filesystem is a fast journaling filesystem, but it has some problems when things go wrong. It doesn't have a real fsck program, it just assumes that it's always good, all the time. So when there is a problem and corruption does occur for whatever reason, finding and repairing that can be difficult.

This brings me to my situation. There was corruption on the root filesystem. There was also a repair tool, on the root filesystem. The repair tool will not execute on a mounted filesystem, not even one mounted read-only. Normally this is fine, boot from external media and repair, but these appliances have no option to boot from external media. There's nowhere for that external media to go.

So, my options are to get a new copy of the filesystem tools (which can allegedly repair a read-only mounted filesystem), or to pull the disks (RAID-1) from the machine, get them mounted in another machine, get the tools on that, get them working properly as a RAID, repair the filesystem, and put them all back into the appliance that contains no user-servicable parts.

I decide to try the newer version of the tools.

Being a rather cautious person, I take an image of the root filesystem first, and ask the new tools what they would do with it. Well, they'd make some pretty big changes to the superblock that look very suspicious; then they'd tweak a few bitmaps here and there; and then they'd segfault.

Having your filesystem repair tool segfault on step two of seven in the repair process is not a way to fill your system administrator with confidence. It's very, very bad. Some people have nightmares where they're drowning, or speaking naked, or their house burns down. Sysadmins have nightmares that their filesystem tools segfault part-way through changing a filesystem.

Now the tools do say they could potentially segfault if run in no-write mode, since they can't repair the things which later parts of the tools expect to use, but my faith is still quite shaken. Okay, let's try the tools again on the image in write mode...

The tool walks through the image, makes a number of changes, and says all-done. I'm not convinced that relocating the entire superblock to a totally different location was a good idea, so I try to mount the file using a loopback device. No joy here, the kernel on these appliances can't handle a loopback image of more than 2Gb. Welcome to 32-bit limit land.

At this point I'm pretty much decided that running the repair on the real filesystem is a Bad Idea, but I decide to continue a few more tests on my image. I mean, it should contain a perfectly good filesystem now, right? So if I run the repair tools again, it should come up all clean? Wrong!

Running the repair tools again claims there are still errors (different ones, and much less) on the system. It also claims that they're fixed, but running it again reveals the same errors in the same places, right where I feel the superblock should be sitting.

Needless to say, I'm recommending that data and services me migrated off this appliance and the entire thing rebuilt/reinstalled from scratch.

I'm only glad that these machines looks like they're going to be retired before the end of the year, and replaced with new servers running ext2/3, with ports for a keyboard and monitor as well as a serial console, and the most blessed ability to boot into single-user mode.