Anecdote Driven Development, or Why I Don't Do TDD

Ovid on 2009-03-08T11:50:38

Number 6
Where am I?
Number 2
In the Village.
Number 6
What do you want?
Number 2
We want information.
Number 6
Whose side are you on?
Number 2
That would be telling. We want information... information... information.

(From the opening of every episode of The Greatest TV Series Ever Made)

Just about everyone in the Perl community who does testing knows that I'm a huge testing fan. In fact, to steal a turn of phrase from Adrian Howard, you might even call me a "testing bigot". I wrote the new Test::Harness that ships with Perl's core. I've written Test::Most and Test::Aggregate. I maintain or have co-maintainership on a number of modules in the Test:: namespace. I was invited to Google's first Automated Testing Conference in London and gave a lightning talk (my talk on TAP is about 42 minutes into that). I was also last year's Perl-QA Hackathon in Oslo, Norway and I'll be at this year's Perl-QA Hackathon in Birmingham, UK. I was also the one of the reviewers on Perl Testing: A Developer's Notebook.

In short, I'm steeped in software testing. I've been doing this for years. When I interview for a job, I always ask them how they test their software and I've turned down one job, in part, because they didn't. If I'm just hacking around and playing with an idea, I don't mind buying some technical debt and skipping a bit on testing while I'm exploring a new idea. I've even posted some code to the CPAN which is a bit short on testing. That being said, I wouldn't dream of writing major software without testing and I don't want to release any CPAN code as version 1.00 without comprehensive test coverage to have a minimum baseline of guaranteed functionality.

A number of years ago at OSCON I was attending a talk on testing when Mark Jason Dominus started asking some questions about testing. He was arguing that he couldn't have written Higher Order Perl (a magnificent book which I highly recommend you buy) with test-driven development (TDD). If I recall his comments correctly (and my apologies if I misrepresent him), he was exploring many new ideas, playing around with interfaces, and basically didn't know what the final code was going to look like until it was the final code. In short, he felt TDD would have been an obstacle to his exploratory programming as he would have had to continually rewrite the tests to address his evolving code. I tried to explain alternate strategies to deal with this, including deleting the code and adding tests back in and, in fact, when I created CPAN distributions of three of the HOP modules, I did find a couple of bugs in my testing. Regardless, I was hard-pressed to rebut his arguments.

Now that's not really a terribly terribly heretical idea. Many people realize that exploratory programming and TDD don't always play well together. James Shore has a great blog post about Voluntary Technical Debt and how this helped them launch CardMeeting. It's the same story: tight deadline, new idea, playing with concepts.

But that's not quite what I have in mind when I say "I don't do TDD". In reality, sometimes I do TDD, but not very often. I've tried TDD. I've written a test, added a stub method, written another test, returned a dummy object, written another test ... and so on.

Two words: tee dious (sic).

At work, we get a fairly substantial set of requirements for each task. We're a core application which many other BBC projects rely on for their programme metadata. When we get requirements, they're generally fleshed out enough that when we think through the problem, we have a decent handle on what needs to be done, even if the exact implementation isn't nailed down perfectly.

When I get a new task, I usually start by reading the tests for the code I'm working on and I often write quite a few tests first and then write the code for it. This isn't "pure" TDD in the minds of many people as I'm not writing a single test, then code, then another test, and then code, ad nauseum. However, it's close enough for TDD in my book.

That's not the only way I write code, though. Sometimes we have a complex case where I want to see what's going on in our interface and I'll load some fixture data, fire up a browser and start exploring our REST interface. Then I'll write some code and verify that our title=Doctor Who query parameters are returning correct results. Then I'll write the tests.

To some, this would be heresy. You must always write the tests first, right? My reply: can you provide me with data to back that up?

I recently wrote some code for Class::Sniff which would detect "long methods" and report them as a code smell. I even wrote a blog post about how I did this (quelle surprise, eh?). That's when Ben Tilly asked an embarrassingly obvious question: how do I know that long methods are a code smell?

I threw out the usual justifications, but he wouldn't let up. He wanted information and he cited the excellent book Code Complete as a counter-argument. I got down my copy of this book and started reading "How Long Should A Routine Be" (page 175, second edition). The author, Steve McConnell, argues that routines should not be longer than 200 lines. Holy crud! That's waaaaaay to long. If a routine is longer than about 20 or 30 lines, I reckon it's time to break it up.

Regrettably, McConnell has the cheek to cite six separate studies, all of which found that longer routines were not only not correlated with a greater defect rate, but were also often cheaper to develop and easier to comprehend. As a result, the latest version of Class::Sniff on github now documents that longer routines may not be a code smell after all. Ben was right. I was wrong.

But what does this have to do with TDD? One problem I have with the testing world is that many "best practices" are backed up with anecdotes ("when I write my tests first ..."). Plenty of questionable assertions (ha!) are made about tests and some of these assertions are just plain wrong. For example, tests are not documentation. Tests show what the code does, not what it's supposed to do. More importantly, they don't show why your code is doing stuff and if the business has a different idea of what your code should do, your passing tests might just plain be wrong.

I also don't write many unit tests unless I'm trying to isolate a particular bug or I have code paths which are difficult to demonstrate with integration tests. I prefer integration tests as they demonstrate that various bits of my code play well together. I've long found, anecdotally, that integration tests make it easier to stumble across bugs -- though I admit that they're then harder to track down. Again, eschewing unit tests is heresy to many test advocates, but in my experience, it works fairly well.

I've also long suspect that TDD can prove to be a stumbling block, but I've advocated it as a way of better understanding what your interface should look like. I've really not spoken too much about my objections to it because, quite frankly, the opinion of the testing world seems to be dead set against me and who am I to argue with the collective wisdom of so many? (I usually get bitten pretty hard when I do so. This is why I want a blog where I can delete trollish or rude comments while keeping reasonable comments of those who disagree with me)

The problem with all of these opinions is that I rarely see hard information about them. I don't see hard numbers. I don't see graphs and statistics and circles and arrows and a paragraph on the back of each one (with a tip 'o the keyboard to Arlo Guthrie). So you can imagine my delight when I read a blog post about research supporting the effectiveness of TDD. This is the meat I want and here's a snippet from the actual research paper:

We found that test-first students on average wrote more tests and, in turn, students who wrote more tests tended to be more productive. We also observed that the minimum quality increased linearly with the number of programmer tests, independent of the development strategy employed.

More productive? Minimum quality increased? That sounds great and now we have at least one study to back this up. Of course, more studies should be done, but this is a great start.

Um, or maybe it's not a great start. Jacob Proffitt has a great blog post where he analyzes the results of the study. He agrees with some of the conclusions of the study, namely ...

  • The test-first students on average wrote more tests.
  • Students who wrote more tests tended to be more productive.
  • The minimum quality increased linearly with the number of tests.

But in digging further down, he found that ...

  • The control group (non-TDD or "Test Last") had higher quality in every dimension—they had higher floor, ceiling, mean, and median quality.
  • The control group produced higher quality with consistently fewer tests.
  • Quality was better correlated to number of tests for the TDD group (an interesting point of differentiation that I’m not sure the authors caught).
  • The control group’s productivity was highly predictable as a function of number of tests and had a stronger correlation than the TDD group.

In short, "test last" programmers had higher quality code. However, you should also read the comments to that blog post, including an interesting reply by one of the authors of the cited study.

So does this prove that you should be "testing last" instead of "testing first"? Of course not. This was only one study. Correlation is not causation. These were undergrads, not professional programmers. There should have been a "no testing" control group (I should note that I've become so dependent on tests that I actually write worse code when I'm trying to modify something without tests).

So for the time being, I'm quite comfortable with my testing approach. I've given up on "pure" TDD of writing one test at at time. I'll happily write a block of tests and then the code. I'm also happy to write a block of code and then the tests. I can now even cite a study to show that I'm not a complete moron, even if one study is little more than anecdotal information.

Frankly, I've been secretly embarrassed about the fact that I'm not really a TDD zealot. Also, I've found with my testing strategy that I'm not writing as many tests as others. This has also been a bit of an embarrassment for me, but I've not brought it up before. My code works well and I'm (usually) comfortable with the quality after a few iterations, but since I've joined the testing cult, my apostasy is not something I've been terribly keen to bring up.

Now if you'll excuse me, I have some more episodes of "The Prisoner" to watch.


Lies, damn lies and statistics

dagolden on 2009-03-08T12:42:58

The sample sizes were far too small to draw any reasonable conclusion.

I don't think you're going to find much "evidence" unless there are large scale studies of complex projects. (Which have their own issues as you don't have good controls). This is the reason why much of "social science" isn't really science.

Speaking personally and anecdotally, however, my hypothesis is that the reported "effectiveness" of TDD is driven largely by two factors: (a) it promotes a well-articulated description of expectations (in code) prior to writing implementation code and (b) it promotes high test coverage as a side effect of development style.

To your point about when TDD is not helpful, then, given that hypothesis, to me it's pretty clear that it won't necessarily help in any circumstance where expectations are either obvious (method call succeeds! woohoo!), expensive to test programmatically at the unit level of granularity (Test::MockTheWorld), or unknown (experimenting with something new). Your approach for writing at a higher level sounds like an attempt to address the cost of very granular testing, but I see as still in the spirit of TDD -- just with a different cost/rigor trade off.

The second point -- promoting coverage, probably could only be examined outside the lab setting. When the control group students are told "write the tests afterwards", they probably do because they know it's part of the study. In the real world, though, various reasons tend to interfere with test writing, as we (anecdotally) know.

Anecdotally yours,

-- dagolden

Re:Lies, damn lies and statistics

Ovid on 2009-03-08T13:33:48

When I am writing several tests first I do think it's sort of in the spirit of TDD but purists might take exception. One extreme TDD exercise I read about sounded very frustrating for the participant, but the person coordinating the exercise responded in the comments that he wouldn't use the "one test, one bit of code, repeat" style every time, so I'm glad that he's not over zealous about it.

Still, I often find myself writing quite a bit of code and then coming back and writing the tests. The times I usually get bitten by this have been when the code authors go crazy on inheritance and I've accidentally overridden an inherited method. That's very frustrating to track down (in fact, I wasted a lot time on this last week when this happened twice). I should at least write the can_ok $object, $method tests first.

On that latter note, perhaps it would be nice if we could mark methods as "final" and throw an exception if they're overridden. Or perhaps mark a method as overriding something. Hmm ... how would I implement that? I see problems with that approach because sometimes you really do need to override something. I wonder about this?

use Attribute::Override;

use parent 'Some::Class';

sub foo : override { ... } # fails if it doesn't override
sub bar { ... }            # fails if it does override

The problem with this would be that traversing the inheritance hierarchy can be slow, buggy (AUTOLOAD, anyone?), and thus not suitable for production, but perhaps it could only trigger if $ENV{HARNESS_ACTIVE} is true?

It also would serve as very handy documentation because you could see at a glance if you've overridden something or not.

Re:Lies, damn lies and statistics

dagolden on 2009-03-08T14:25:14

On the "purists" comment -- it could be the normal difference between the teaching environment and the real world. In the teaching environment, the importance of "proper" process tends to get exaggerated. In the real world, with experience, we take shortcuts.

On API testing -- I've often done "can_ok" either for a class or else for "main" to confirm automatic exports. What I've never really done (well) is confirm that no other methods exist. I might need to look into Class::Sniff or related techniques for that.

-- dagolden

Re:Lies, damn lies and statistics

jdv on 2009-03-08T16:00:22

Your idea for detecting accidental overrides seems overly complex.

How about just comparing what is meant to be overridden to what Class::Sniff->overridden says has been.

Re:Lies, damn lies and statistics

Ovid on 2009-03-08T16:41:25

Class::Sniff is too heavy weight for this. It also captures code at a snapshot in time. It doesn't tell me if the method cache is invalidated (MRO::Compat will let me do this with mro::get_pgk_gen).

TDD group

sigzero on 2009-03-08T17:31:55

I wonder if the TDD group forgot the final phase (refactoring)? You create your test, create your code to pass those test and then your refactor the passing code into best practices. If you forget that last part I can see how you would lag on code quality.

Clarification

btilly on 2009-03-08T22:40:50

My previous comments on method length were not advocating longer methods. I generally prefer shorter methods myself. But there is a tension between what I believe works and the research I have seen.

The problem with the studies cited in Code Complete is that they were generally done in procedural languages before people adopted object oriented approaches. Does research on what works in procedural programming still apply to code written in object oriented or functional styles? Good question.

I don't have an answer to that. But other research has consistently found that function complexity matters more than function length. A long function that happens to include a big hunk of text is generally not a maintenance problem, while a shorter function which is a twisty maze of conditions and loops often is. One measure of complexity is cyclomatic complexity which is basically the number of decision points plus 1. That is 1 for entering the function, and 1 for every if, while, ||, etc. The rule of thumb is that a complexity not apply to object-oriented code.

However I know of no research on this topic. This is just my personal impression.

On your main article, I'm religious about very little in the programming world. However to my eyes the value of testing is not that you write better code, it is that you have an easier time maintaining that code once written. In fact I would maintain that good tests and poor code is more maintainable than no tests and decent code, because with the former you know when it breaks, and you are more able to address the deficiencies in the code.

Furthermore on the research that you cited, I am hardly surprised that when you take people and ask them to develop code in an unfamiliar way, the effort of thinking about the new skill and new discipline will make them perform worse. While programming we have to juggle a lot of things in our head, and having to hold something unfamiliar in your head while you do that will reduce your ability to manage complexity. Therefore more of your mental balls are likely to drop, and that shows up as worse quality.

However I would then go further and argue that what holds true for beginners holds true, albeit to a lesser degree, for experienced programmers as well. Switching from coding to writing tests and back interrupts the flow of your thoughts. Given how valuable mental flow is, this disruption has a cost. The disruption is lessened because of the relationship between the test and the code.

For "boring" code this overhead is a non-issue. For "interesting" code, any overhead is going to limit your abilities - it is much better in my experience to write the code in a single go, then go back and write tests to validate what you did. In general reading code you wrote half an hour ago takes far less concentration than writing that code. Plus in the process of writing tests you get the opportunity to revisit your logic and catch bugs. While I am a fan of not commenting code unnecessarily, if it was "interesting" to write, it may well be worthwhile to insert some sort of explanatory comments as you write your tests.

The on case where I absolutely am in favor of test driven development is in bug fixing. You have code. Someone found a clear bug in it. You write a test that shows the bug in operation. You fix the bug. You see that the bug went away. There are three reasons for this. The first is that writing the test for the bug often narrows down where the bug is likely to be. The second is that it can be useful to look at what is happening in a bug in detail, in which case the test you just wrote that creates the bug situation is very useful. And the third reason is that if you do things in the reverse order there is a real possibility that you make a change that doesn't fix the bug, then write a passing test that would have passed anyways, and don't fix the bug that you thought you fixed. (This is not hypothetical. One ticket assigned to me for a company I am contracting or is about a bug that a previous programmer "fixed" in exactly this way. Since the bug is only triggered on a race condition they couldn't test the fix either way, and the only way they know it didn't work is that they are still getting errors in their log.)

Re:Clarification

lachoy on 2009-06-24T16:33:08

FWIW, Tim Bray echoed the maintenance sentiments in this blog post, more grist for the mill: http://www.tbray.org/ongoing/When/200x/2009/06/23/TDD-Heresy

TDD not useful for HOP?

Dominus on 2009-03-11T17:53:43

For a lot of the HOP code, TDD would have been useful. My OSCON question was specifically about the "linogram" system of chapter 9.

"Linogram" was a program that was completely unlike any other program I had ever seen. When i started, I had no idea what the input language would be like, no idea how it would work, even no idea of what the program's capabilities would be. If you had asked me "will linogram be able to draw a box with an arrow coming out of the side" I would have said "I think so, but I'm not sure about the arrow". And I also didn't know how a user would ask for such a diagram, or how the program would figure out where the box and arrow went, or how big they were, or how the user would specify those things, or how the actual rendering would be done.

People at OSCON came up to me afterwards and insisted that it was possible, but it became clear that they did not understand what I was talking about. "There must be a specification," a couple of people said. "Just write the tests to check the requirements in the specification." There was an amazing disconnect here between what I was asking and what they were answering.

Often there is a specification, or I at least have some clue what the behavior of the program should be before it is written. But very often the scope of the project is almost completely undefined, and my job, as engineer, is to find some reasonable balance between scope and cost and then implement it.

If someone is still not getting my point, or thinks I should be using TDD anyway in such cases, I invite them to consider this example. I have a great idea for a new kind of word processor for writing philosophic research papers. It will not be a WYSIWIG word processor. Instead, you will draw a picture of the structure of your document and the relationships between the various arguments you intend to make. Then the word processor will let you fill in the details of each argument, checking to make sure that the relationships are as required and that the arguments are sound.

I would love to see this program, and even more I would love to see the tests for this program. Advocates of TDD-everywhere are invited to work up the test suite and send it in.

Or, if you are scratching your head and asking "what do you mean by 'relationships', exactly", then you begin to see the problem.