My Love/Hate Relationship with CPAN Testers

The Great Part

I really like the idea of a CPAN testing service where individuals volunteer their computers to tests CPAN packages and those results are accumulated and shared.

The accumulated results then are tallied with other result. People can use this information to help me decide whether to use a package or when a package fails if others have a similar problem.

Comparing the CPAN Testers to Travis (which I also use on the github repository), the CPAN Testers covers far more OS’s, OS distributions, and releases of Perl than I could ever hope try with Travis.

And that’s the good part.

The Not-so-Good Part

While there is lots of emphasis on rating a perl module, there is very little-to-no effort on detecting false blame, assessing the quality of the individual CPAN testers, the management of the Smokers, or of the quality of the Perl releases themselves.

Yet the information is already there to do this. It is just a matter cross-tabulating data, analyzing it, and presenting it.

Suppose a particular smoker reports a failure for a Perl module on a particular OS and distribution and Perl version, but several other smokers do not report that error. It could be that there is an obscure or intermittent bug in the perl module tested, but it also might be bugs in the Smoker environment or setup, or bugs in the Perl release.

Even if it is an obscure bug in the program that many other smokers don’t encounter, as a person assessing a Perl module for whether it will work on my systems, I’d like to know if a failure seems to be anomalous to that or a few smokers. If it is anomalous, the Perl module will probably work for me.

Rating a Perl Release.

Going further there is a lot of data there to rate the overall release itself.

Consider this exchange:

Me:

This Perl double free or corruption crash doesn’t look good for Perl 5.19.0 Comments?

Tester:

5.19.0 was just a devel snapshot, I wouldn’t overrate it. Current git repository has 973 commits on top of 5.19.0.

Well and good, but isn’t that failure permanently marked in the testing service as a problem with my package when it isn’t? If 5.19.0 was flaky, isn’t the decent thing to do is to retract the report? Or maybe in the summary for this package the line listing 5.19.0 should note that this release was more unstable than the others?

Again, what sucks is that to me it feels like blame will likely forever be put on the package. In those cases where the report is proved faulty, well tough, live with it. It is the hypocrisy that bothers me the most — that the service attempts to be so rigorous about how a Perl module should work with everything, but so lax when it comes to ensuring what it reports is really accurate.

And this gets to the aspect how well or poorly the smokers are managed.

I mentioned before that if a particular smoker is the only one that reports a failure for that particular OS distro and Perl release, the smoker might be suspect. And if that happens with several packages, then that suggests more that smoker (or it could be a set of smokers managed by a person) is at fault. It may still be true that there may be legitimate bugs in all of the packages; perhaps the smoker has not-commonly-used LANG and LOCALE settings. But again, as things stand there is no way to determine that this smoker or set of smokers managed by a single person exhibit such characteristics.

Knowing that is both helpful to the person writing the failing package(s) as well as those who might want to assess overall failures of a particular package.

Rating the Testers and Responsiveness of Testers

There is an asymmetry in the way testing is done. Testers can choose Perl modules, but Perl Module authors, short of opting totally of the testing service, can’t choose testers. I think it is only a matter of basic fairness. The premise that Perl Modules will get better if they are tested and rated also applies to the testers.

I think one should have the ability for Perl Module authors to rate the responsiveness of testers of those reports they get (unsolicited except at the too coarse scale of opt-out of everything).

Let’s say I get a report that says my Perl Module fails its test on this smoker. Unless the error message clearly shows what the problem is (and again cross-checks to ensure validity are lacking) or unless I can reproduce the problem in an environment I control, I’m at the mercy of the the person running the smoker.

As you might expect there are some that are very very helpful, and some that I’ve sent email too and just don’t get responses. Having a simple mechanism where I could +1 or -1 the tester and the testers accumulated score that sent along with the report would be great. That way, if get several reports with failures I can pick which tester to work with first.

Given the fact that there is no effort to make each smoker not duplicate the work of others, in theory if the problem really is in the Perl Module rather than the tester’s setup or the Perl version, I should get multiple reports.

Alternatives to CPAN Testing Service?

I believe there are alternatives to the CPAN testing system. Any comments on them and how good they compare the CPAN testing system? Is there a way to have those show up in metacpan.org or search.cpan.org?

15 Comments

I've never seen CPANTS as rating modules, or really paid any attention to reports for others' modules. I look at the logs for my own modules' test failures, try to figure out what's wrong, then decide if it's a problem worth fixing. It's great for flagging possible issues for further investigation; to "rate" a module, I look at the bug tracker, the author, and maybe cpanratings.

If you don't want public scrutiny, get out of open source :)

The reports on other modules are useful to me when I want to fix something in a distribution I didn't upload.

> Having a simple mechanism where I could +1 or -1 the tester and the testers accumulated score

Small problem: Authors have many distributions, each distribution has many versions. Every time author upload a new
version he starts with a clear "reputation" (except the case when his previous uploads hang and broke the box and author was banned by tester).

And testers just have one "reputation"

> I think one should have the ability for Perl Module authors to rate the responsiveness of testers of those reports they get

Agree, some testers don't really care much about accuracy of reports.

If I get a bug on some weird environment setup I want to know about it so I can fix it. Whats a false positive to you may not be a false positive to me. And believe me, I hate my test failures as much as the next guy (/shakes fist at BingOS's OpenBSD boxes testing my XS modules).

I wouldn't mind tests failures resulting from not being installed/finished testing fast enough (and hence timeout) not counting though.

> Why is this a problem?

> There is an asymmetry in the way testing is done. Testers can choose Perl modules, but Perl Module authors
> Having a simple mechanism where I could +1 or -1 the tester and the testers accumulated score
> I think one should have the ability for Perl Module authors to rate the responsiveness of testers of those reports they get

There is no asymmetry. But if we start scoring testers with +1 or -1, there will be asymmetry indeed.

Authors can score testers. Testers cannot score authors - this is asymmetry.

Symmetry IMHO would be:

1) Authors can downvote testers for false positives. Testers can downvote authors for false reports of false positives (i.e. when author suspect tester is broken,
but in fact author testsuite or code is broken)

2) Testers can choose Perl modules. Authors can choose testers (I think need just introduce env var like CPANTESTERS_TESTER_EMAIL and build script
will be able to exit with N/A status if tester "banned")

btw I think currently testers really can ban authors only by sharing blacklists and using shared blacklists (i.e. not really an automatic way to do this).

I love CPAN Testers and try to contribute. I've been thinking about firing up some oddball box for more smoking (it's easy and fast to do yet another x86/gcc/Linux tester, but that's well covered).

Having the ability for a tester to retract results would be nice. I once tried smoking on a Windows box, which went fine until it hit the too-many-characters-on-a-command-line issue and then started spewing fail reports that had nothing to do with the modules being tested. Not only might one want to retract individual reports, but it would be nice to have a "Remove all reports for configuration ID XYZ: that platform/configuration was borked."

Having the ability to attach comments to smoke results (both by the tester and module owner) seems useful, but I'm not sure how many people would really dig that deep.


Re: rating of testers, I'd like to see if separated out into (1) tester, and (2) configuration. #1 is purely how responsive the person is. #2 is a rating of the results that are coming from from that smoker process.

Reason one for splitting: some people do smoking on multiple devices, so indicating how good the results are per configuration is useful. They may be running a 5.16-on-x86-gcc-Linux setup as well as a 5.19.2-on-ARM-clangdev-BSD, where the former is doing well but the latter may have some failures that turn out to be related to the dev Perl or dev compiler or whatever. This does require some way to identify a smoking configuration in addition to the tester id, and get a new configuration id once the tester has decided something has changed enough to justify it.

Reason two for splitting: Let's say I did something dumb and one of my smokers sent out bad reports. Immediately 10 people give me -1 votes. Well I may as well quit smoking forever, because I'll never get +1 votes back without pestering people to upvote me (at which point those old bad results now start showing up and I get -1'd again).

I love CPAN testers unreservedly, and am puzzled by the OP's apparent belief that he, or at least his failing modules, are stigmatized by test failures. I certainly have never felt this way.

It almost makes me wonder of the OP suffered some negative consequences from the misuse of CPAN testers data. And it is misuse to stigmatize a distro or a person simply on number of test failures. Failures can come from any of a number of causes, as the OP points out.

The thing is, there are many cases (or at least enough for me) where the relevance of a failure can be determined only by the potential user of the module, and not a priori. Do I as a Mac OS user give a flying obscenity about failures under Windows or MidnightBSD, even if they originate in the module? I think not.

The situation is similar with failures tied to (and even originating in) a given version of Perl -- especially development versions. If I want to use a given module, and certain Perl versions get blacklisted, I have no idea whether I need to avoid those Perl versions.

As an author, I feel that the more feedback I get the better off I am. Yes, I have seen failures that I did not understand (failure to load a module which is a listed prerequisite, for example) and ones which I am pretty sure are failures of the smoker. The CPAN testers are human. That means they make mistakes, and they have bad days. But they are volunteers, and given that CPAN Testers is the killer app for module authors, I would rather chase a few false positives than risk throwing the baby out with the bath.

On the other hand, I would not be opposed to any plan that allows individual authors or modules to opt in or out of individual testers or smokers. I would like the information to be public, though, since as a module user I might favor a module with a few test failures over one with a clean slate but opted out of a bunch of smokers.

> I love CPAN testers unreservedly, and am puzzled by the OP's apparent belief that he, or at least his failing modules, are stigmatized by test failures.

I did not see OP example, but here are real false positives:
http://www.nntp.perl.org/group/perl.cpan.testers.discuss/2013/06/msg3162.html
http://www.nntp.perl.org/group/perl.cpan.testers.discuss/2013/07/msg3163.html
http://www.nntp.perl.org/group/perl.cpan.testers.discuss/2013/06/msg3160.html


> If I get a bug on some weird environment setup I want to know about it so I can fix it.

Example of weird environment could be vendor perl (RHEL?). It differs from real perl, it can have different set of core modules.
But I did not see examples of this on CPANTESTERS.

Let me address one of your points in detail, before I get to the over-arching point:

Or maybe in the summary for this package the line listing 5.19.0 should note that this release was more unstable than the others?

This is already done, by saying that it's perl 5.$odd_number. These are dev releases of perl. Most (all?) tools for looking at test results can distinguish between release versions and dev versions of perl, including the CPAN-testers website and CPANdeps. Testing on dev releases of perl is useful, because it's a great way of both finding bugs in perl, and also of notifying authors in advance that their code might break on the next release of perl.

Over all, your ideas for authors (and testers) being able to mark test reports as incorrect and for testers to mark the author's opinion thereof as incorrect, and for authors to rate testers - both of these are good ideas. The first of them is on the to-do list, but hasn't been done yet because the volunteers who write and maintain the CPAN-testers infrastructure only have limited time and there have so far always been higher priority tasks. The second isn't, as far as I know, on anyone's to-so list.

The solution to both of these is the same. If you want them, and you want them to happen sooner than would happen otherwise, you need to either persuade someone that it's a good use of their time, or you need to do it yourself. The best place to start is the CPAN Testers development site and the CPAN-testers-discuss mailing list. Anyone with good ideas for improving our service is welcome there, especially if they're willing to do more than just have ideas but to actually implement them.

You later commented that "maybe the reports shouldn't be available publicly". Hell no. Reports exist both for the benefit of authors (to notify them of bugs) but also for the benefit of users (to warn them before they use buggy code).

Rocky, you complain about false negatives and that you cannot remove them from the database. However i find myself hopelessly confused that you ignore the ability to cut a new release, or make your module detect a tester and just give back any kind of result you want. I've as a tester personally had to deal with modules that went "Oh, this is windows, my module works, but i don't know how to test on windows, so i'll issue myself a PASS here no matter what."

If you say that you're the one with not enough power here, then you're not being nearly creative enough.

Leave a comment

About rockyb

user-pic I blog about Perl.