Identifying CPAN distributions you could help out with
The other day Andy Lester posed a question Where can someone find Perl modules to contribute to? My first answer was to look at the dists with the most bugs. I continued thinking about it, wondering how you could identify a module that is ripe for help.
This post outlines my next idea, and the top 20 dists based on my first implementation.
If you're going to contribute, it's most motivating to do something that's going to be used. So the idea is to look for dists that are still getting bugs raised against them, but that haven't seen a release for a good while.
Dist | Released | Bug days | Gap | Score | ||||
---|---|---|---|---|---|---|---|---|
Perl6-Parameters | 2002-08-17 | 2 | 3769 | 1884.50 | ||||
Crypt-Primes | 2003-01-16 | 2 | 3617 | 1808.50 | ||||
TAP-Formatter-HTML | 2010-03-21 | 1 | 997 | 997.00 | ||||
SOAP-WSDL | 2010-03-28 | 1 | 990 | 990.00 | ||||
Acme-Brainfuck | 2004-04-06 | 4 | 3169 | 792.25 | ||||
POE-Component-CPAN-SQLite-Info | 2008-10-14 | 2 | 1519 | 759.50 | ||||
IO-Digest | 2004-09-11 | 4 | 3011 | 752.75 | ||||
Net-CIDR-Set | 2009-01-30 | 2 | 1411 | 705.50 | ||||
IO-Async-SSL | 2011-02-28 | 1 | 653 | 653.00 | ||||
Proc-ParallelLoop | 2003-03-13 | 7 | 3556 | 508.00 | ||||
Log-SelfHistory | 2010-08-07 | 2 | 857 | 428.50 | ||||
CGI-Application-Plugin-LinkIntegrity | 2006-05-18 | 6 | 2395 | 399.17 | ||||
Catalyst-Authentication-Store-LDAP | 2010-10-05 | 2 | 798 | 399.00 | ||||
XiaoI | 2008-08-18 | 4 | 1574 | 393.50 | ||||
IO-Plumbing | 2008-08-21 | 4 | 1571 | 392.75 | ||||
Data-Transform-SAXBuilder | 2008-08-27 | 4 | 1565 | 391.25 | ||||
PITA-POE-SupportServer | 2008-09-02 | 4 | 1559 | 389.75 | ||||
Config-Tiny | 2011-03-24 | 2 | 628 | 314.00 | ||||
Pod-Spell | 2001-10-27 | 13 | 4052 | 311.69 | ||||
Text-Identify-BoilerPlate | 2005-08-22 | 9 | 2661 | 295.67 |
- Released is the date of the last release of the dist.
- Bug days is the number of days since the last open bug was raised.
- Gap is the number of days between the last release and the most recent bug.
- Score is Gap / bug days.
Here's the top 20 for a slightly different measure. In the table below, gap is the number of days between the most recently reported still-open bug and the oldest still-open bug. If there's only one bug, then gap will be 1, so the dist won't appear here.
Dist | Released | Bug days | Gap | Score | ||||
---|---|---|---|---|---|---|---|---|
SOAP-WSDL | 2010-03-28 | 1 | 1596 | 1596.00 | ||||
TAP-Formatter-HTML | 2010-03-21 | 1 | 1500 | 1500.00 | ||||
Math-BigInt | 2011-09-04 | 2 | 2568 | 1284.00 | ||||
Params-Util | 2012-03-11 | 1 | 1174 | 1174.00 | ||||
RT-Client-REST | 2012-01-09 | 2 | 2251 | 1125.50 | ||||
DBI | 2012-11-20 | 2 | 2148 | 1074.00 | ||||
Crypt-Primes | 2003-01-16 | 2 | 2147 | 1073.50 | ||||
IPC-Run | 2012-08-30 | 4 | 3937 | 984.25 | ||||
Path-Class | 2012-12-09 | 3 | 2752 | 917.33 | ||||
Net-DNS | 2012-12-12 | 3 | 2593 | 864.33 | ||||
Filesys-SmbClient | 2012-12-04 | 4 | 3315 | 828.75 | ||||
Config-Tiny | 2011-03-24 | 2 | 1635 | 817.50 | ||||
SQL-Translator | 2012-10-09 | 4 | 3261 | 815.25 | ||||
Authen-Captcha | 2012-08-14 | 4 | 3051 | 762.75 | ||||
libwww-perl | 2012-02-18 | 5 | 3634 | 726.80 | ||||
Storable | 2012-09-11 | 6 | 3892 | 648.67 | ||||
SQL-Interp | 2012-02-08 | 3 | 1824 | 608.00 | ||||
Net-CIDR-Set | 2009-01-30 | 2 | 1187 | 593.50 | ||||
Perl-Tidy | 2012-12-09 | 3 | 1771 | 590.33 | ||||
HTML-TagCloud | 2011-06-18 | 5 | 2230 | 446.00 |
Note: these are really identifying modules that are potentially worthwhile candidates for taking over (getting co-maint), rather than modules where you could contribute without having to take over maintenance. That's a separate list!
Some thoughts for improving this:
- It's skewed towards bugs raised within the last few days — too much so. Maybe instead of bug days as the denominator, I should use log10, to smooth things out.
- A dist may have been bug-free until yesterday, so hasn't needed any releases. I could look at the number of bugs that have been reported since the last release, that are still outstanding.
- Even further, I could look at the elapsed time between the oldest open bug and the most recently reported open bug.
- Factor in the number of dists that are dependent on each dist, and weight the score based on this.
What else should be factored in? I'll play a bit more, then put a longer sortable list online. I love the fact that I could get hold of all the metadata needed to create this. Now I feel like I should find a bug to fix!
What are the odds that my help would actually be merged though? I've submitted patches to popular Perl modules before, but I'm not sure they ever got looked at, much less merged.
Perhaps look at which modules are still getting commits as well?
Hmm, I just submitted two bugs for Crypt::Primes, which probably made it go to the top. One would be pretty easy to fix, the other is a critical breakage and would require more thought, but isn't huge.
But I already have a module that duplicates most of the functionality, so spending time fixing that module seems not the best use of time. I found the issues in Crypt::Primes while testing mine, and reading the paper they're based on and wondering why the code didn't match.
OK, so you shamed me into submitting patches. Well done! :) Based on the comments on Crypt::RSA I'm not sure if the author is still around however.
Overall, I like the idea of something like this list. I like most of your additional suggestions. It would be nice to weight based on "importance", which is being somewhat taken into account by the reverse dependencies. It'd also be nice to give negative weight to modules that have unanswered patches, though how one would programatically get that I'm not sure. After all, if someone has already solved an issue but it's being ignored, it's probably a waste of time submitting patches for other issues.
preaction:
That differs from maintainer to maintainer, no? If you want some level of confidence before you invest yourself I’d say check the issue tracker of the module for an idea about how responsive the author is. (NB.: recent activity should be weighted a lot higher than past phases of inactivity.)
A more interesting way of scoring bugs by importance would be taking into account the number of other modules that depend on a given one in a first approximation, or even overall. From this point of view, probably CGI is one of the most critical ones, and, well, Acme::Brainfuck not so much.
It's not clear how to submit patches to CGI, however, other than just email them. Where's it developed?
If the author doesn't respond to your patches, then you could aim to take over the module. Fork it on github, and fold your patches in. Put a pointer to your repo in the RT comments. Two weeks from now, email the author again, and say that you'd be happy to take over the module and release fixes. Then in a month's time when you apply for co-maint, you can point the PAUSE admins to your fork and patches, and your attempts to engage the author. Cc the author on that email as well.
With regards to CGI, the latest release isn't from 2001, but from a month ago.
MetaCPAN shows the github repo for it, so contributing to CGI is easy.
I suspect that's down to my flaky working out of the dist name. ANDK pointed out CPAN::DistnameInfo, so I'll have a look at that tonight.
@Neil: I'm interested in the tool you probably wrote to build this list. Share it!
@Neil: There's an unfortunate situation—not yet resolved—on rt.cpan.org with the existence of both a CGI.pm queue and a CGI queue. This originally stemmed from differences in distname parsing. The maintainers currently direct you to CGI.pm, but other sources direct to CGI. The CGI.pm queue should be merged into the CGI queue, and it's on my list of things to tackle in rt.cpan.org. That may be one source of confusion in the data I pointed you at originally.
Ah thanks Thomas. I have multiple issues caused by the fact that the distname for the CGI module is 'CGI.pm'. I've submitted a bug on that, but will work around it for now. And as you say, there's extra confusion caused by the fact that there are RT queues for both CGI and CGI.pm :-)
I had some bugs related to parsing of partial release paths (eg P/PH/PHISH/CGI-XMLApplication_0.9.3.tar.gz). Turns out CPAN::DistnameInfo doesn't handle a number of cases, so I hacked some more on mine, and will work on some fixes for CPAN::DistnameInfo.
Perhaps this should be (or already has been?) formalized into a standard procedure.
This is my pragmatic process based on the official process, which I've fixed on while working on reviews.
Second request for a list based on impact, e.g. reverse deps.
Yeah, I think that'll probably be the next thing I hack in.
What code did you use to determine the number and age of outstanding bugs? I was thinking of creating a Kwalitee data point on these stats.
You can get the RT bug data as an SQLite database from https://rt.cpan.org/NoAuth/cpan/rtcpan.sqlite.gz
My query for getting the oldest and newest unresolved bug per dist is