Buggy Hardware

Matts on 2005-03-11T20:13:46

D. Richard Hipp just posted this great story to the sqlite list:

I've been struggling for days to get version 3.1.4 out. Every time I would run the regression test I would get failures. The failures would not always be at the same place, but I would always get one or two.

I frequently got failures in the memory-db tests where we create a large in-memory database, make lots of changes, roll those changes back, then verify that the database holds exactly the same information as it did before the transaction. In a database of about a megabyte in size, I would sometimes see a single bit difference after the rollback. The bit that changed would always be the 0x08 bit. But the location of the change within the database was seemingly random.

I was talking with Dan about this yesterday - he was unable to reproduce the problem. So I said "Maybe it's hardware?" "Not likely", Dan replied. And rightly so. No programmer ever wants to admit that a nasty problem might be lurking in their own code. It is always easier to blame something else - some library you are linking against, the operating system, the hardware you are running on. But at the end of the day, the problem usually does end up being in your own code and not elsewhere. So after you have been programming for a while (decades in my case) you begin to be very suspicious when people go blaming malfunctions on the parts they didn't write.

But last night, I was at wits end trying to track down the problem in SQLite. I figured it can't hurt to test the memory, so I rebooted using the SuSE install disk which happens to have a nifty memory checker built in. About 10 minutes into the test, some errors popped up. On a 512MB SIMM, less than 10 memory cells where showing a problem, and then only if a specific bit pattern was written into adjacent cells. The error was always in the 0x08 bit. I removed the offending SIMM, rebooted and all tests passed.

I find it utterly amazing that a machine with bad memory could run a full-blown Linux desktop and a copy of Win2K running in VMWare for days on end without showing a problem, then suddenly begin having trouble with the SQLite regression suite. Yet that is what appears to have happened.

Now it is still always the best policy to blame your own code first. When something isn't working right, the person sitting behind the keyboard is the most likely cause. Sometimes you will run into problems with the library you are using, or with your compiler, or your OS, but those cases are rare. Hardware is seldom an issue. But as this case shows, sometime, very rarely, it really can be the hardware's fault.