The Strangest Bug

Ovid on 2002-10-22T16:34:41

Well, we're stumped. A site that has previously been working fine with MS SQL Server 2000, IIS 5.0 and Windows 2000 has started to crash and burn ... intermittently. I've been working on some updates and a section of code that I have never changed is causing the problem.

if ( $page eq $page_control{default}{page} )
{
    my $loginName = $db->get_user_info($user);
    my $message   = $db->get_admin_message;

It's that last line which is killing us. I even have it traced down to the actual statement handle execution, internally. I still don't know why it fails. That's not the weird part. Normal debugging methods have failed because when this crashes, everything locks up. We get no error messages from the database, from IIS, from anything.

I've tried $|++ and putting in carps before this code to narrow it down. No dice. There's nothing in the error logs.

I tried tossing in a BEGIN block to dump state information. No dice. Still nothing in the error logs.

It's almost as if the program senses that this method call, buried in a conditional a couple of hundred lines after the beginning of execution, is going to lock up so the program decides that it's a waste of even trying to run. If this program is executed in such a way that the condition is false, everything, including the debug warnings, works fine. The only way I managed to track down the bug was to dump some test output at the top of the program, put an __END__ token after the test output and do a binary search with this through the program to narrow down the exact cause of the failure.

Truly, this is the strangest bug I have ever seen. Frankly, I suspect that this is some weird ISAPI error, but I can't prove it yet.

Update: We've narrowed the problem down to one database table that has one record and one field. We can't select, update, or delete that record, though we were apparently successful in inserting a new record. I say "apparently" because we still can't select anything from the table to verify it. We tried dropping and recreating the table, but SQL Server will just timeout when we try and drop it. Of course, this doesn't answer the question of why a problem with SQL Server will cause ISAPI to stop writing error messages to the logs ...


Win file locking issue?

wickline on 2002-10-22T18:54:08

What sort of database? Is each table a file?

If so...

Is it possible that an earlier instance of the script crashed or was interupted while it had that file open? Can you read the file in a text editor, or does NT give you an error?

Do you see different behavior depending on whether you read/copy the file on the local machine vs via filesharing?

If you can't read the file from a share (and possibly locally), it may be that Win thinks the file is in use, possibly by a crashed/interupted previous instance of the script (or maybe the script was fine, but Win just spaced out and forgot to remove the busy/lock note).

In that case, re-booting will solve the problem. Reboot, and you'll be able to read the file again.

Warning: if the contents of the file are screwed up, it is possible that the next time your script runs it might fail when trying to read the file and crash, leaving the file in the same messed up state (you can't access it until you reboot), so check the file contents after reboot before you run that script again.

I ran into this with DBD::AnyData (CSV files) shortly after moving a project to a new win2K server. I don't know what caused the initial corruption, but from that point on, any time the script tried to touch that file, everything went kaboom and the file was foobar until the next reboot.

The only symptom was that one day the script stopped working for most varieties of functionality. It took a bit of digging to find out that it failed on accessing a specific file, and then since we couldn't get in the file, it took a bit longer to realize what was going on.

If this is happening to you, there may be other files which were horked at the same time... Any file which was open when the script crashed is suspect. Hopefully only the file-lockeness will be screwed up, but it might take a few reboots to be able to continue debugging to find which file had the corrupted content. You may have to wade through several files that have normal content but just got locked by the script crashing.

Or maybe this is nothing like what you're seeing...

Good luck :)

Post to let us curious minds know how it turns out!

-matt

Re:Win file locking issue?

Ovid on 2002-10-22T19:37:39

The database we're using Microsoft SQL Server 2000. We stopped and restarted that server and, before anything else could touch the database, we successfully dropped the table. I've added it back and everything seems to work perfectly. Of course, we haven't done any load testing on the new table, so it might be a bug waiting for its opportunity to bite. It's extremely frustrating that we cannot find any logs anywhere which refer to this table or what the problem might be.

I've double-checked the code which updates the table and it appears to correctly commit the transaction, so I don't think anything is hanging up there. I'm still scratching my head. If we get more info, I'll post it.

SQL Server Admin tools

dws on 2002-10-22T19:16:09

The better DBA's I've worked with have been able to extract a healthy pile of clues from the admin tools provided with SQL Server (e.g., who has what locked).

Since you're dealing with a single row in a single table, my number one cause of mystery grief on SQL Server--row locks getting escalated to page locks--probably isn't an issue.

Re:SQL Server Admin tools

Ovid on 2002-10-22T19:50:57

dws wrote: The better DBA's I've worked with have been...

Ovid replied: ha, ha, ha, grunt, snort, ha!

We let our "DBA" go on the grounds that we couldn't afford him. That's true because while he might have been qualified to be an intern, he certainly wasn't qualified to be a senior DBA. He lied his @$$ off on his resume and when he was hired, no one was capable of evaluating his performance -- not surprising given that this company once hired a CTO who didn't know what FTP was. Then I returned to the company after a leave of absense and the man who became my boss was hired. We were astonished at how little our "DBA" knew. He didn't appear to know SQL very well, did not understand the rules of database normalization and we were forced to redesign everything he came up with. I can't tell you how frustrating it was trying to explain to our "DBA" that a user name is a rotten primary key (one of his rebuttals: "If you need to change it, you can just write a script to change all of the instances of it!" :-). This idiot was also, to the best of my knowledge, the highest paid employee in the company. Some how, some way, I need to develop the chutzpah to pull off a stunt like that.

I wish we had a DBA here, but we've been forced to struggle along without one. Of course, I also wish we had a QA department. I wish we had more permanent funding. I wish, I wish, I wish.

On the other hand, there is kind of a pleasant surrealness working for a high-tech company that heats its office with a wood stove.

noarmuyzed

wickline on 2002-10-22T20:43:44

"If you need to change it, you can just write a script to change all of the instances of it!"

Hmmm...

Actually, you could remove a layer of indirection by eliminating the username entirely. That way, if it changes, you don't have to update anything at all. You'll still be able to refer to users by other unique fields. The probability that any two users have the same hire date and birth date is probably sufficiently low ...and if not, you can just throw in more fields until you eliminate the duplication.

Then you just put all those fields into each relevant table instead of the no-longer-needed username (or that stupid extra arbitrary 'id' field you keep mentioning). Huh? What do you mean "normalization"? Listen, this will work... I'm telling you.

I guess you should probably put the username back in the table though, because we may still need to refer to it. While it may change (I guess you're right about that) so it probably won't make a good key, we should keep it around. We'll just use all these other fields as our key though.

If you need more fields later to eliminate an unexpected future duplication, you just write a script to duplicate all those fields for all the user-related records in all the tables too. In the very unlikely event that two records can't be distinguished by using more and more fields, only then would you have to add your arbitrary 'id' field which would auto-increment as needed.

The advantage to this approach is that if/when you ever get to the point that you have to add that extra 'id' field, you'll be able to demonstrate a dramatic decrease in database size and increase in efficiency. You see, at that point, you'll be able to use that 'id' field as the key, and re-work your tables to eliminate scores of fields. Because, you see, if you do it right, you can use just that one 'id' field instead of that big combination of fields.

And as every certified MS professional can tell you, space savings and application speed-ups are two important ingredients in the Bonu$$ recipe.

-matt, toungue firmly in cheek