RTFM - I should have learned it by now

htoug on 2004-12-03T09:56:15

I've just wasted nearly a whole week chasing a wierd bug in DB_File (or so it seemed at first).

We have a number of Alphas running Tru64 (about 20) and our home directory is NFS-mounted on all the machines (so you have your files with you everywhere).

The web-development Alphas have just beeen upgraded from Tru64 v4.0G to v5.1B - the server people did a clean install, so as to remove any lint collected over the years.
I had sync'ed /usr/local (with out perl environment and other tools) from the Perforce depository and tested.
Everthing was OK.
The machines were released to the users and everyone seemed happy until the guy on the table next to me said that his Apache::ASP setup acted strangely, it crashed whenever he fiddled with the $Session-variable in his scripts.
"Hmmm...." I said. "I'll have a look at that". Famous last words.

The apache server crashed with a segmentation violation deep down inside Apache::ASP's internals - I finally tracked it down to a call to DB_File.
"No problem - I'll just upgrade DB_File" I thought - perhaps something has broken in the upgrade - so a recompilation might be called for.
Download the latest DB_File, and do Perl Makefile.PL, make test and boom DB_File fails in its tests.

Wierd - DB_File shouldn't SEGV.

Check rt.cpan.org for known bugs: I found one, where someone reported that the version of the underlying BerkeleyDB delivered with Tru64 v5.1B had a bug.

Download the latest berkeleyDB, compile and install. Retest DB_File: still boom.

Now I'm getting desperate. One of out test-machines runs the same Apache::ASP application as originally gave the error, so I looked there and everything semed ok - it was at a lower patchlevel than the dev.box but otherwise the same.
I tested DB_File on that: boom, hmmmm....

I looked in the DB_File code, and it seems that it sends a wrong argument to the db_del function - deep down in a macro in an XS-file. I wanted to make a small test-case, so that I could see if that was the error. So I headed back to BerkeleyDB, and looked for example code.
A the README passed over my screen something about 'NFS' caught my eye.

* To run the test harness for this module, you must make sure that the directory where you have untarred this module is NOT a network drive, e.g. NFS or AFS.
I was doing all this in my 'work' subdirectory to my home - which is NFS mounted!

Move over to /tmp and try again.

Everything works!

It seems that Tru64 v4.0G had an undocumented feature that allowed BerkeleyDB files to be on an NFS-mounted filesystem. This feature has been removed in v5.1B.

Once again the lesson is: when debugging, take your time. Check every step, and do read the READMEs - don't assume that you can remember it from last you installed, things may have changed.

The fix was to move Apache::ASP's session-state files away from the home directory tree to somewhere in /tmp, and everyone is happy.


amen to that:

Qiang on 2005-09-01T13:12:52

when debugging, take your time

a rule we all know but don't apply sometimes or worst most of the time. the same can be applied to day to day life. i don't remember how many times i have relearned that rule. Doh!