Diminishing Marginal Utility of Tests

Ovid on 2008-09-12T13:46:19

(What follows isn't a particularly earth-shattering discovery, but it does detail how tortured my thought process can be when I finally face the obvious).

What are the odds that your test suite will catch 100% of the bugs in your software? As most of us know, those odds rapidly approach zero when you have more than, say, one line of code. (Thought experiment: how many real or potential bugs are in sub recip { 1 / shift }? It's more than the obvious "division by zero" error). Of course, we also know that we're not really writing tests to catch bugs. We're writing tests to assert if p, then q or if p, then not q. If we do find a bug, then we write a test, but that test is still some variant of if p, then ....

Now we already know that we can't cover all bugs with tests, but we also know that we can't cover all cases of if p. When was the last time you tried open my $fh, '<', $filename... when $filename contained a 3 megabyte string? Ever tried that? I haven't. Not many of us have. I could come up with tons of if p situations you've never thought of.

This is because of the "path problem" of code. For any reasonably sized body of code it's impossible to predict all possible paths through the code with all possible data. You might do code coverage and have good statement, branch and conditional coverage, but you can't get reasonable path coverage because that's NP-complete.

So what does this mean? It means that you are accepting that your test suite isn't perfect, but we're so used to this that we don't think of the implications of this. Needless to say, that's what I've been doing lately. As a result, my test suite took about 45 minutes to run yesterday. Today it takes less than 7.

I didn't get this performance out of the triggers I was using. I got from setting a 'FAST_TESTS' environment variable and skipping tests which take too long and provide marginal value. The latter point is really the key.

The first thing I did was make sure that our 20+ minutes of acceptance tests were skipped. That's because developers shouldn't rely on acceptance tests. Then I took our "spider and validate" tests out -- that was another few minutes. I also removed our "database migration" test. Those took a long time and used to silently fail any way, demonstrating that they weren't that useful.

The main issue here is that we're required to do the full run before we commit to trunk, but it's OK to skip plenty of tests while developing. If doing this means we can develop faster and are more likely to run some tests, that's a win. We can't keep going on the way we have. We're also going to keep looking for tests to delete and you know what, it's possible (though not desirable) we may lose some coverage here. I think I'm OK with that. If it's too much pain to have those extra tests, are they really worth it?

More and more we see developers checking things in because they can't be bothered to run the entire test suite. Those who do (me) often don't things done as quickly because of how often we run that damned suite. As a result, our beautiful, excellent coverage, moderately well-organized test suite is hardly the useful tool it looked like. There are still plenty of other "speed up the tests" strategies we could employ, but in terms of bang for your buck, this may be the one for us. Test suites are almost always compromise, but we're staring the compromise in the face and deciding which trade-offs to make. All things considered, this is a huge relief.

When I eventually get around to finishing the SQLite backend for Test::Harness, you'll be able to make these decisions more confidently. You may be better prepared to note when test suites are being run. You may notice more failures as suites take longer and developers ignore them. You might notice long-running test programs which never fail -- begging the question of whether or not those could be skipped. I'm looking forward to having more tools which can let me analyze things like this and make appropriate decisions.

This what smoke testing is for

autarch on 2008-09-12T14:39:06

At Socialtext, we had a similar problem with long test suite runs. The solution was to set up a system exclusively for running the test suite repeatedly. It'd check out various branches (including trunk), run the tests, and update Smolder. Smolder would email people watching the branch if tests failed.

Since we did all our unstable dev on branches (back then), this meant that we saw test failures well before they got merged to trunk, and usually we saw them within an hour or two of the actual checkin, making it relatively easy to track down the problem.

We'd also run tests manually as part of our dev, but I for one rarely ran the whole test suite that way.

Re:This what smoke testing is for

mpeters on 2008-09-12T16:08:16

This is the point I was trying to make in http://use.perl.org/comments.pl?sid=40931&cid=64824. Maybe it's a management thing that Ovid can't control, but there's no reason to force people to repeatedly run automated tests. Computers do a great job at boring repetitive tasks.

Re:This what smoke testing is for

Ovid on 2008-09-12T19:53:57

It's not a management thing at all. I understand that smoke testing is great for this, but I still want to be sure that I can run a good set of tests repeatedly while developing and not wait "an hour or two" to find out if there's a problem. I especially want to do this prior to a check in. Perhaps it's a difference in style, but I want comprehensive feedback immediately.

Re:This what smoke testing is for

mpeters on 2008-09-12T20:18:40

I still want to be sure that I can run a good set of tests repeatedly while developing and not wait "an hour or two" to find out if there's a problem. I especially want to do this prior to a check in.
I completely agree with this. But why does that "good set of tests" have to be a predetermined list? Why can't it just be the tests that excercise the feature you're working on? This means it will be different for every developer and changing pretty much every day.

Re:This what smoke testing is for

chromatic on 2008-09-12T20:59:36

Why can't it just be the tests that excercise the feature you're working on?

If that gives you enough confidence that you haven't caused regressions elsewhere, great! That's not always the case.
A comprehensive test suite is incredibly valuable, and (sometimes) end-to-end tests are the best way to achieve that. I have my doubts about the utility of continuous integration servers however, and I firmly agree with James Shore on ten-minute builds.

Re: mosts tests are a waste of time

Eric Wilhelm on 2008-09-13T05:28:04

You write unit test, then the unit, then unit test the unit which uses that unit. Do you test the kernel first or just the functionality you're running on top of the kernel (and libc (and perl))?
So, end-to-end testing for real-world inputs and outputs gives you all of the coverage you actually need, but of course unit tests have the nice property of isolating the functionality under test so that you can see what you're doing when you're working on that given chunk.
The question is whether the test is a development aid or a QA check.
But, when I look at the problem description: "do the full run before we commit to trunk", I think you need s/we/robot/, which is a tools problem and not a testing issue.

Re: mosts tests are a waste of time

chromatic on 2008-09-13T08:06:46

If you have good unit test coverage, and if you have well-coupled and well-factored units, you can get away with only a few comprehensive end-to-end tests. The trouble comes when your test suite is so slow that it's impractical to run it before every commit. Then you face the temptation to shove your tests off into a continuous integration server, and you risk checking in broken code and losing your momentum when you interrupt your current task to switch back to the previous task you didn't actually finish.
It's a bad situation.

Re: watching tests are a waste of human time

Eric Wilhelm on 2008-09-13T19:04:04

For me, waiting more than 30 seconds for the tests to run is too long and I've already lost momentum. If a well-covered change passes all of its unit tests and perhaps one level of units up from that, plus the bug test, your probability of failing any other test due to that change is low enough that you win overall by simply handing the rest of the smoke+checkin off to a bot (that could even run on your machine.)
In the 5% of commits where the bot comes back and yells at you, you're still at break-even in the long run even if switching contexts takes 19 times as long as running the tests (but if it takes 4.5 hrs to recover your context after 15 minutes, you should probably just go home.) If you're breaking the smoke 25% of the time, you can still write-off the 15 minutes spent working on something else, and then lollygag yourself back into the problem for 30 minutes because the other 75 times of a hundred you would have wasted almost 19 hours watching the tests pass.
But me, well... I get impatient just waiting for the commit to complete on my local network.

Re: watching tests are a waste of human time

chromatic on 2008-09-13T19:18:31

I suppose everything depends on how often you run the full tests. For me, it's on average every 20 - 60 minutes. I can invest five minutes in that confidence before checking in a change.
If I ran the full test suite between every change I made to the source code even if I'm not ready to check in, that's a different story.

Re:This what smoke testing is for

jplindstrom on 2008-09-13T12:27:45

Running the whole test suite before committing the merge to trunk is a given. That's not the problem, the trade-off between time and stability is very easy.
The problem we have is when we are too many people working on the same thing, in the same branch (generally one branch per feature).
So if you check in, breaking something, it's not so bad if it only affects you. Even if you don't know about it until an hour later, you still know what you were doing you know it's up to you to fix it.
The problem is when something breaks and you don't know who did it. That's the real time waster, having multiple developers sitting there investigating a failure because they all think the broke the build.
This is only a problem some of the time. Mostly the features are small enough, and there aren't too many people involved.

Re:This what smoke testing is for

Ovid on 2008-09-13T14:07:59

Thanks for clarifying that. I should have pointed that out.

Re:This what smoke testing is for

autarch on 2008-09-13T14:16:10

To clarify. At Socialtext, when we did this sort of thing (this was 2 years back and I hear things have changed), we had many, many branches. Most branches belonged to very few devs (often 1), and were for one feature or one bug fix. Some branches lasted only a commit or two.
That limited the scope of whow as affected by breakage that was caught by the smoke tester.
The problem is, even getting a full test run down to 7 minutes (from 30, say) is still way,way too long to run all that often. When something takes 7 minutes, it might as well take 30, cause you're going to stop watching the terminal, and go look at a web page while you wait.