Why CPAN Testers sucks if you don't need it

dagolden on 2008-09-03T21:22:41

There have been some mega-email threads about CPAN Testers on the perl-qa mailing list that started with a question about the use of 'exit 0' in Makefile.PL.<\p>

I want to sum up a few things that I took away from the conversations and propose a series of major changes to CPAN Testers. Special thanks to an off-list (and very civil) conversation with chromatic for triggering some of these thoughts.<\p>

Type I and Type II errors<\p>

In statistics, a Type I error means a "false positive" or "false alarm". For CPAN Testers, that's a bogus FAIL report. A Type II error means a "false negative", e.g. a bogus PASS report. Often, there is a trade-off between these. If you think about spam filtering as an example, reducing the chance of spam getting through the filter (false negatives) tends to increase the odds that legitimate mail gets flagged as spam (false positives). <\p>

Generally, those involved in CPAN Testers have taken the view that it's better to have a false positives (false alarms) than false negatives (a bogus PASS report). Moreover, we've tended to believe -- without any real analysis -- that the false positive *ratio* (false FAILs divided by all FAILs) is low. <\p>

But I've never heard a single complaint about a bogus PASS report and I hear a lot of complaints about bogus FAILS, so it's reasonable to think that we've got the tradeoff wrong. Moreover, I think the downside to false positives is actually higher than for false negatives if we believe that CPAN Testers is primarily a tool to help authors improve quality rather than a tool to give users a guarantee about how distributions work on any given platform.<\p>

False positive ratios by author

Even if the aggregate false positive ratio is low, individual CPAN authors can experience extraordinarily high false positive ratios. What I suddenly realized is that the higher the quality of an author's distributions, the higher the false positive ratio.<\p>

Consider a "low quality" author -- one who is prone to portability errors, missing dependencies and so on. Most of the FAIL reports are legitimate problems with the distribution.<\p>

Now consider a "high quality" author -- one who is careful to write portable code, well-specified dependencies and so on. For this author, most of the FAIL reports only come when a tester has a broken or misconfigured toolchain The false positive ratio will approach 100%.<\p>

In other words, the *reward* that CPAN Testers has for high quality is increased annoyance from false FAIL reports with little benefit. <\p>

Repetition is desensitizing

From a statistical perspective, having lots of CPAN Testers reports for a distribution even on a common platform helps improve confidence in the aggregate result. Put differently, it helps weed out "outlier" reports from a tester who happens to have a broken toolchain.<\p>

However, from author's perspective, if a report is legitimate (and assuming they care), they really only need to hear it once. Having more and more testers sending the same FAIL report on platform X is overkill and gives yet more encouragement for authors to tune out.<\p>

So the more successful CPAN Testers is in attracting new testers, the more duplicate FAIL reports authors are likely to receive, which makes them less likely to pay attention to them.

When is a FAIL not a FAIL?

There are legitimate reasons that distributions could be broken such that they fail during PL or make in ways that are not the fault of the tester's toolchain, so it still seems like valuable information to know when distributions can't build as well as when they don't pass tests. So we should report on this and not just skip reporting. On the other hand, most of the false positives that provoke complaint are toolchain issues during PL or make/Build. <\p>

Right now there is no easy way to distinguish the phase of a FAIL report from the subject of an email. Removing PL and make/Build failures from the FAIL category would immediately eliminate a major source of false positives in the FAIL category and decrease the aggregate false positive ratio in the FAIL category. Though, as I've shown, while this may decrease the incidence of false positives for high quality authors, the false positive ratio is likely to remain high.<\p>

It almost doesn't matter whether we reclassify these as UNKNOWN or invent new grades. Either way partitions the FAIL space in a way that makes it easier for authors to focus on which ever part of the PL/make/test cycle they care about.<\p>

What we can fix now and what we can't

Some of these issues can be addressed fairly quickly.<\p>

First, we can lower our collective tolerance of false positives -- for example, stop telling authors to just ignore bogus reports if they don't like it and find ways to filter them. We have several places to do this -- just in the last day we've confirmed that the latest CPANPLUS dev version doesn't generate Makefile.PL's and some testers have upgraded. BinGOs has just put out CPANPLUS::YACSmoke 0.04 that filters out these cases anyway if testers aren't on the bleeding edge of CPANPLUS. We now need to push testers to upgrade. As we find new false positives, we need to find new ways to detect and suppress them.<\p>

Second, we can reclassify PL/make/Build fails to UNKNOWN. This won't break any of the existing reporting infrastructure the way that adding new grades would. I can make this change in CPAN::Reporter in a matter of minutes and it probably wouldn't be hard to do the same in CPANPLUS. Then we need another round of pushing testers to upgrade their tools. We could also take a decision as to whether UNKNOWN reports should be copied to authors by default or just sent to the mailing list.<\p>

However, as long as the CPAN Testers system has individual testers emailing authors, there is little we can do to address the problem of repetition. One option is to remove that feature from Test::Reporter and reports will only go to the central list. With the introduction of an RSS feed (even if not yet optimal), authors will have a way to monitor reports. And from that central source, work can be done to identify duplicative reports and start screening them out of notifications. <\p>

Once that is more or less reliable, we could restart email notifications from that central source if people felt that nagging is critical to improve quality. Personally, I'm coming around to the idea that it's not the right way to go culturally for the community. We should encourage people to use these tools, sign up for RSS or email alerts, whatever, because they think that quality is important. If the current nagging approach is alienating significant numbers of perl-qa members, how can we possibly expect that it's having a positive influence on everyone else? <\p>

Some of these proposal would be easier in CPAN Testers 2.0, which will provide reports as structured data instead of email text, but if "exit 0" is a straw that is breaking the Perl camel's back now, then we can't ignore 1.0 to work on 2.0 as I'm not sure anyone will care anymore by the time it's done.<\p>

What we can't do easily is get the testers community to upgrade to newer versions of the tools. That is still going to be a matter of announcements and proselytizing and so on. But I think we can make a good case for it, and if we can get the top 10 or so testers to upgrade across all their testing machines then I think we'll make a huge dent in the false positives that are undermining support for CPAN Testers as a tool for Perl software quality.<\p>

I'm interested in feedback on these ideas. In particular, I'm now convinced that the "success" of CPAN Testers now prompts the need to move PL/make fails to UNKNOWN and to discontinue copying authors by individual testers. I'm open to counter-arguments, but they'll need to convince me of a better long-run solution to the problems I identified.<\p>

-- dagolden<\p>


FAIL feed?

autarch on 2008-09-03T22:12:08

One thing that'd be immensely helpful to me is an RSS feed containing just FAILs. As it stands I regularly get 20-80 reports in my RSS reader, most of which are not interesting (PASS, NA, etc). I just want the failure reports!

Re:FAIL feed?

barbie on 2008-09-04T00:29:11

It's already in the development site, thanks to David E Wheeler :) However, right now you'll have to wait a little longer for it, as I feel now is no longer the right time.

Re:FAIL feed?

acme on 2008-09-04T07:51:06

Ignore the whiners and release anyway ;-)

Also ...

autarch on 2008-09-03T22:16:38

Yes, only getting failure reports for actual test failures would be great. I can live with failures because a broken toolchain didn't install the deps properly. But getting fail reports because the path was too long because they're using Red Hat, or whatever, is just damn annoying.

And one more thing!

autarch on 2008-09-03T22:18:19

Sheesh, I should think and just post once.

Anyway, as an author who does have mostly high quality distros (I think), there are still cases where CPAN testers is really useful.

Recently I shepherded the new Class::MOP and Moose releases. These included a lot of big changes, so we did dev releases specifically to get CPAN tester reports back. These reports were very helpful and we cleaned up quite a few bugs that way. This specific use case was great, and not something I'd really done before.

Thanks!

jplindstrom on 2008-09-03T22:45:13

I just wanted to take the opportunity to thank all the people who help me find bugs in environments I don't have access to.

You're doing a great job!

Thanks &amp; feedback

polettix on 2008-09-04T10:30:37

I'd like to join jplindstrom in thanking all CPANTesters (both frontend and backend) for the good service.

One thing that surprised me a little was this "message avalanche" effect, which I thought was an already-dealt-with problem. In particular, I notice that I only get one single FAIL message for each module that I upload to PAUSE that has problems, even when there are more of them. Am I totally misunderstanding this point in your post, or have I been gifted by some obscure bug^Wfeature?

Re:Thanks &amp; feedback

dagolden on 2008-09-04T13:21:51

Some CPAN Testers have configured their machines not to CC authors at all or not to CC authors for certain grades or PL/make/test phases. I suspect your experience is serendipity, not fundamental design.

-- dagolden

Useful, but what is the bigger picture?

mr_bean on 2008-09-13T02:10:59

What is the system of which CPAN testers is a part?

I don't think you are standing far enough back.

(Fred Brooks in The Mythical Man-month on documentation: Start off at 50,000 feet and come in really slowly.)

What is the function of CPAN testers?

One size is not going to fit all. For example, Ben Bullock, the person said to have ruined the CPAN ratings feed, may be the kind of developer who needs to be hit with a heavy stick.

http://use.perl.org/user/ology/journal/37363

Admittedly, that can be worked out in the future.
The fact there is controversy suggests options are needed.

Anyway, Kudos to the CPAN Testers!