The Four Major Problems with CPAN

First of all, I love CPAN. CPAN was the first of its kind, to provide an extensive and official library of modules that support the language itself. CPAN is very much a part of Perl as much as regular expressions are. Without CPAN, Perl would never be as versatile or useful as it exists now.

But, CPAN is also very old. It's been around since late 1995, almost 20 years now. As such, it has grown to house many thousands of modules, 119,124 to be exact (as of today). Many would consider that virtue, as you can find almost anything you want on CPAN. However, it betrays a underlying problem with CPAN that only increases with age:

A hundred thousand modules is too much stuff to sift through.

Look at a Cathedral-based setup like .NET. The .NET Framework provides a LOT of utilities and libraries for a great many things that you want to do, and it can do this in a 50MB file. Good luck trying to compose a similar library like that for CPAN. You could try, but somewhere down the road, you're going to end up with many decision points for which module to use for protocol or task X.

For example, let's look at something basic and simple like opening a IP socket connection. A search on MetaCPAN reveals:

Others that aren't immediately obvious are:

Errr, I just want to open a TCP connection somewhere. Oh, and since ARIN is telling everybody that we're almost out of IPv4 addresses, I'd like to have something that is IPv6 compatible. Which one should I use?

  • IO::Socket is the old standby, but it's not the right module for Internet sockets.
  • IO::Socket::INET is the right one for IP sockets, but it's not IPv6 compatible.
  • Socket looks like it could be useful, but it's not OO and way too low level for most people's needs.
  • Socket6... shouldn't this be retired in favor of Socket?
  • IO::Socket::INET6 is OO, functions like IO::Socket, and is compatible with IPv6. But, it's the "wrong" module. Why? Well, it's been refactored into another module with a different author.
  • IO::Socket::IP is the "right" module. Somehow, you're supposed to just know this.

Most of these modules are not even packaged in the same distro. What was wrong with adding IPv6 support to IO::Socket::INET? The answer is likely pretty complicated, but some of that comes with the Bazaar-based repository model. Different people making different modules in different distros using different coding styles and different environments. If this was a Cathedral like Microsoft, the answer would be "WTF?! Put all of this s**t into one library!"

Overall, though, I like the Bazaar model, because you can get many results much faster than a Cathedral. For example, the .NET framework doesn't have anything for many protocol specific items, like SNMP or Telnet. (You typically have to pay for these from a third-party.) But, just because it's a Bazaar model doesn't mean that the weaknesses of that model should always exist, unable to be fixed or mitigated.

Thus, I'd like to identify what I consider to be the four major (specific) problems with CPAN and some potential solutions to these problems:

1. Too many modules are unmaintained; abandoned but not marked as such.

Perl veterans know this problem all too well, and I touched on the issue in the example above. Distros have owners, typically a single owner, and those owners sometimes move on to other things. Or they don't have the tuits (the round ones) to maintain the distro. Or maybe they completely switched languages and aren't interested in Perl any more.

However, the users don't know this, at least not officially. They continue to submit bugs for distros that haven't had releases in years in an RT tracking system with tickets that are 7+ years old. Yes, these are warning signs, but there's nothing official saying that it's "unmaintained" or "abandoned".

There is a way to take over a distro, but it's a process that takes months to complete. Furthermore, doing that requires a certain level of commitment to say "Yes, I want to own this distro and take full responsibility for its bugs and issues". Many people don't want to go that far. They just want their patch implemented, so that it JFW.

My Solution: Create an "official" orphanage (tied to GitHub) with an automated abandoned status checker. I have a more detailed plan, but it's too large to include here. I will discuss this in my next blog post.

2. There is not enough data on what modules are mature; which ones are the "right ones" to use.

Again, IO::Socket::INET6 vs. IO::Socket::IP is one good example. Others are:

Unless you ask the right people, or go the hard route and try several of them, you're going to have problems figuring out exactly which module you should use.

My Solution: Work on better scoring of module relevancy, maturity, etc. Search engines like MetaCPAN could be leveraged to give you more accurate and relevant results, based on a number of pieces of information from the distro. I have an initial set of scoring methods in the planning phase right now.

3. Many modules are only used for semi-private needs.

These categories would include:

  1. Testing (Acme::Prereqs)
  2. Personalized use (Task::*, DZIL Author bundles)
  3. Training (search for "The great new")
  4. Are deprecated (search for "deprecated")

It's a lot of cruft that could be buried (or potentially deleted), and it clutters up search results.

My Solution: Add a "distro_type" variable to CPAN::Meta::Spec, which would then be used by search engines like MetaCPAN. The 'keywords' item could be used as a stopgap, but a more official status variable should be implemented in the long run.

4. Modules cannot be renamed or deleted, even with a long-term deprecation process.

Names in CPAN are sort like domain names with two critical exceptions: it doesn't cost anything to take a name and they last forever. The lucky guy who got the Net::IRC name will continue to have this name forever and ever, despite the fact that the module clearly states "DEAD SINCE 2004" on the title. (Edit: mst: and you wouldn't believe how much chasing around I did to get control of Net::IRC to add that message :) Most modules like that don't tell you that kind of warning, though. So, people who are used to the Net::* namespace will think "Hey, if Net::Telnet is the Telnet module and Net::DNS is the DNS module, then I should try Net::IRC for my IRC needs".

In the world of the capitalistic domain name system, it's still somewhat of a problem, but not nearly as bad. Google.com still takes you to Google, Perl.org still takes you to the Perl Foundation, and CPAN.org takes you to CPAN. Even if you can't quite find it from a straight domain entry, Google's search engine is powerful enough to find you exactly what you are looking for (almost) every time.

(And I won't even get into the modules called ::ButActuallyWorksThisTime...)

My Solution: Implement a deprecation process that would eventually remove the module from PAUSE. This could be tied into the "distro_type" variable above. Yes, this would be voluntary, but other processes (such as the orphanage) could enhance the automated processes.

Caveat: Modules and distros are not the same thing. Would this only apply to full distros, or is there a way to remove indexing from modules?

17 Comments

I'm not arguing, but adding.


1) The "official" orphanage is the ADOPTME user, mostly because as a PAUSE admin I started putting stuff there and no one has told me to stop. No matter what you decide, authors who have disappeared aren't going to do things on their own.

I wouldn't say the the distro take-over process takes "months". Weeks is more likely, and during that time the person taking over the namespace can still do all the work and even upload unauthorized dists. So far, it's the person who wants to take over who needs to do the legwork to track down the original author and remind PAUSE admins that a reasonable response time has passed.

However, there's no good way to automated abandoned. Some modules are just done, as MJD noted in slides 9-11 of "Twelve Views of MJD". There are certainly lots of heuristics one could apply to find candidates, but I wouldn't want to find out one day that some automated process has taken away one of my modules.

2) If I had all the time in the world, I'd become an actual CPAN librarian (although Jarkko has that title). There are several MLSes using Perl already, and I've thought about going through the steps to become one myself.

3) With all the semi-private stuff, we get the good stuff. Sometime we just need the search results to have better page ranking. I think this is really the same thing as 2).

4a) We have the technology to delete modules, but socially we don't. Any author can mark their module as "can be deleted from database". Even though some modules are dead, we have a great process that revives them. It's a little onerous because as PAUSE admins we balance the interests of the original author and the person who wants to take over, but we do it quite often.

4b) I've also thought, in my theoretical completely new and vaporous CPAN client, you'd see that sort of information in the installation plan. However, seeing it means almost nothing because that doesn't change the dependency.

4c) So, if you care about Net::IRC, why isn't it the module for IRC? We can make that happen. Bump to a major version and change the interface. It's happened to other modules. :)

I have another comment that deserves it's own entry.

I am constantly and consistently surprised about how many people do not first try to work with existing projects. One of the general weaknesses of equal access is that people don't have to do the scary work of talking to other people. I think there should be a lot more collaboration, and for many reasons I won't list, things such as CPAN do not encourage it although they don't do much to get in the way of it, socially.

> File::Spec vs. Path::Class

I'd actually argue the answer is now "Path::Tiny", but only people who are glued to irc and/or blogs every day would know this, as it's very, very new.

> @ISA vs. base vs. parent (hint: It's base, but the docs won't tell you that.)

It's actually parent (unless you really really like the fields pragma), although there isn't 100% consensus on this.

I totally agree there's a problem. I've only started being able to navigate through the CPAN maze clearly by being on irc every day. Those people who don't have time for that, or who don't even know that this is the new way of staying up to speed on developments, are basically screwed. (see also: http://modernperlbooks.com/mt/2012/06/perl-without-irc.html )

Oh yes, it's also come up a few times that we need a way for a CPAN author to say "I don't want to abandon this dist entirely, but I do need help with it." Yanick has proposed putting this in metadata - e.g. see https://metacpan.org/module/Dist::Zilla::Plugin::HelpWanted.

It's actually parent (unless you really really like the fields pragma), although there isn't 100% consensus on this.

Ether, stop parroting unsubstantiated stuff. The "fields support" slowing down base.pm is a myth. Please read what Brendan linked and consider adjusting your reality distortion field: http://lists.scsys.co.uk/pipermail/dbix-class-devel/2012-December/000211.html

@Peter: I suppose this is another example of the original thesis, as my first assumption would be that if parent is in core, that it is intended to be used (and indeed its synopsis recommends that if fields are not in use, parent > base).

This discussion was held several times over the years. A solution INSIDE the CPAN package is pointless here. For one: It doesn't address anything about EXISTING packages, which are 99% of the problem, new modules are rarely having these problems. So next, if you mandate definition inside the META file, then someone will NOT do it, or do it wrong. What you have then? RIGHT, the exactly same state then before.

As said (1-2 years) ago, when we had this very same heated discussion around MetaCPAN (why noone cares about previous discussions about such topics? Why the people just make new posts?!?! whatever...), it was clear that it must be an outside info (like CPAN ratings, voting on MetaCPAN and so on).

If you want to provide a clean module world, then serve it to them on a silver platter so to speak, make a list for beginners or make a filtered CPAN display and let others point people to there. And so on. YOU must provide a solution, and you cant expect the CPAN authors from do it right, that will just not work. Your entire post was nothing more then a refresh of the 10 years old common issues of CPAN. Make an issue on MetaCPAN and suggest something visually cool. We already point all people to MetaCPAN, jump on this train. Or if you dont like that, make a new site that gives options for this.

The solution you propose is proposed roughly every year, and it continues to make no sense each year it is proposed.

Also about the personalized modules.... i have no idea why you think noone else should use Dist::Zilla::PluginBundle::Author::GETTY. There is no way to clearly define this. BTW in this particular case *ANYONE* who is going to contribute to a module of mine needs to install this package. Additionally it is documented and can be used for your own dist without any problem.

In the end, still there is so much difference in the modules, some modules are bullcrap in combination with others, there are different working ways, different concepts, which sometimes just cant work together. Do we add a flag also for this and who would follow all the flags then? This goes up into an unlimited list, which could be solved differently in the code (not in CPAN). Whatever i drift off to the bigger problems ;)

I totally agree with your post, I also totally agree with the last comment, less the dismissive tone. Having used C# and the .NET framework I can attest to it's robustness and awesomeness. That being said, for the sake of brevity, what you're essentially looking for is curated library lists/bundles which the CPAN already has (e.g. Task::Kensho, etc) and you could even start your own.

Everything else is a matter of preference and you have many options. In keeping with the comparison to .NET, you could easily create a curated bundle on github with instructions on installing the libraries without runnings tests and it'll be as fast of an install as .NET.

I think it would be "nice" to have something at least somewhat visible via the metacpan ui, but it definitely has to be "external" data to classes, data provided by user submission / crowd sourcing / data analysis , NOT via dist provided metadata.

I have enough trouble as it is getting people to just use features in the metadata that are standard and relics of antiquity like abstract and license

I personally use metacpan ++ system as a first cut. Of course its still slightly underutilized and there are searches where nothing is ++ed. Still if I see that one thing in my search has 100 ++ and nothing else has more than 2 I will consider that one first. This is a good plan.

I encourage everyone (I'm talking to me too) to ++ the modules you use regularly, it really can help the signal to noise.

(But don't use it to the exclusion of others, who knows where that next diamond in the rough will come from)

As the author of two of the six modules you point out in the first place, I feel I should write a little reply here.

As to the Socket vs. Socket6 question: Originally, Socket6 was written by someone to provide the various IPv6-related functions missing from core's Socket. Eventually I got around to taking over core's Socket and adding them properly there where they should be. This now makes the Socket6 module totally redundant. Should that now be marked? If so, where? I have no write access on Socket6, so I can't write it there. I could notate it in Socket, if anyone would think to look there. Finally, I could ask Socket6's author to "hi, you wrote this thing, but it's now totally redundant and not necessary; please delete it". I don't feel that will go down well at all.

However, most users should never need to do much poking around with Socket directly, as that's what the higher-level object wrappers are for. Which brings me on to my next point...

I wrote IO::Socket::IP to fix various shortcomings in ::INET6, which I felt didn't implement IPv6 wrapping properly. ::INET is the specifically IPv4-only module. For reasons that much existing code expects that ::INET only ever deal with IPv4 addresses, that module cannot be changed to use IPv6 as well. Finally, IO::Socket is a base class that exists only for specific protocol modules to subclass to provide things like this.

So my question to you is simple: In each of Socket, and IO::Socket::IP's cases, what should I as the author have done? How can I have helped the situation, providing more information, and avoided the doubt you initially raised?

See this MetaCPAN feature request from one year ago. The more scoring indicators to be used for ranking in MetaCPAN, the better. How about explicit statements like "I prefer module Foo in favor of module Bar" to be used for ranking and as guides too?

Other than voting, metacpan or cpan site should include a counter that tells how many times a particular module has been installed or downloaded. This is one of the measurement used by many users when there is a huge catalog. For example, mobile apps installed from app store has this counter, the greasemonkey scripts installed from userscripts.org has something similar.

What I believe, is CPAN model should be similar to Mobile app store, where the app store gives information like; reviews, rating, no of downloads, other modules by author, suggested modules and so on.

@brian d foy said:

> I am constantly and consistently surprised about how many
> people do not first try to work with existing projects.

For me, GitHub has been the single biggest factor in lowering the barriers to entry to making contributions to existing CPAN modules. I find I'm far more likely to contribute changes to an existing module when all I need to do is fork it and send a subsequent pull request. Maybe if more source code was handled via GitHub (or similar mechanism) there might be less of a tendency to reinvent the wheel as opposed to enhancing existing code.

Yet Another Problem with the ideas (not that I wish to discourage you--definitely keep pursuing these ideas, as they're much needed--but just to make sure you're thinking of all the issues) is that, while sometimes there is a "right" and a "wrong" module, sometimes there isn't. That is, sometimes there are two modules that both do the job and there just isn't a clear reason to prefer one over the other. Perhaps one is OO and the other isn't. Perhaps one has more features but the other has a simpler interface. Perhaps one works well for A but not at all for B, while the other works well for B but not at all for A. Sometimes we can make an argument that the two should be combined; sometimes that's less obvious.

I think the debate over base vs parent is a great example of where the lines are blurred. Peter says:

> The "fields support" slowing down base.pm is a myth. Please read what Brendan linked and consider adjusting your reality distortion field:

I did read what Brendan linked. I didn't take away from that that I should use base instead of parent at all. What I took away was that a) I should never try to use both in the same project, and b) while base is no longer slower than parent, it almost certainly takes up more memory than parent. So, as long as I make sure no `use base`s are slipping in, I still don't see any compelling reason to stop using `use parent`.

But the point is, you have the choice on CPAN, and choice is good. Except when it's not. Which I suppose is the point of this blog post ...

But, really: carry on. Everyone wishes it was better. Even the people who bitch you out. :-D

Leave a comment

About Brendan Byrd

user-pic Programmer since I was 5. Perl since 1996. Parent since 2009.