MetaCPAN operational view

This is the third in a series of articles about MetaCPAN. The first article described the two main parts that make up the MetaCPAN project, the API and the search interface. The second article gave a high level summary of how the API uses Elasticsearch to hold and search information about CPAN distributions and authors.

In this post we'll look at how MetaCPAN links to other parts of the CPAN ecosystem, how the physical setup has changed with MetaCPAN v1, and another service that v1 has made available.

This post is brought to you by Booking.com, our second platinum sponsor. Booking.com is one of the largest Perl shops in the world, and have done a lot to support our community over the years. Thank you to Booking.com for supporting meta::hack.

MetaCPAN's aggregation of CPAN data

As outlined in previous articles, one of the big advantages of MetaCPAN is that it aggregates information from a number of sources, and provides a common interface for accessing a unified view of metadata about CPAN modules, distributions, and authors. Hence the Meta in MetaCPAN.

The following shows the main data sources:

MetaCPAN data sources

The different sources update at different rates, so MetaCPAN polls their information at different rates. This is why you might notice that some of the information for your release has been updated, but other bits haven't.

Before we go any further, this seems a good time to remind you that MetaCPAN is open source. All systems under the MetaCPAN umbrella live in GitHub, under the metacpan project (https://github.com/metacpan). The two main repositories you'll want to look at are:

The MetaCPAN project has a dedicated core team, but if you look at the contributors page, you'll see that 123 different people have submitted at least one pull request. Maybe you could be the 124th?!

Watching CPAN

A lot of MetaCPAN's data comes from CPAN itself. To enable that, MetaCPAN acts as a CPAN mirror. We use rrr-client to keep an up-to-the-minute (actually a bit faster than that!) copy of the CPAN master mirror. This happens even before the content has been fully indexed by PAUSE. Once processed, the distribution is instantly available under MetaCPAN's recent uploads page — as an author you may have been surprised at how quickly your releases appear on MetaCPAN's recent releases page.

In addition to the releases themselves, there are three important files that MetaCPAN gets from CPAN as well:

  • The CPAN Index lists CPAN modules and the latest release that contains them. This is the same index used by CPAN clients when you ask to install a specific module.
  • The file 06perms.txt lists all modules and who has permissions to upload new versions of them. In fact, anyone can theoretically upload a new version of a module, but if you don't have permissions for the module, your version won't be listed in the CPAN Index.
  • The 00whois.xml file lists all PAUSE users. That way MetaCPAN can display author pages for people who haven't currently got any releases on CPAN.

MetaCPAN updates this information twice an hour, so it may take up to 30 minutes for a new release to appear fully on MetaCPAN.

When looking at a distribution or module on MetaCPAN, you'll notice the following in the left-hand sidebar:

The "Issues" link is to the bug tracker used for the distribution, typically either RT or GitHub. Notice that MetaCPAN lets you know how many open issues there are. RT (rt.cpan.org) provides a text file which has this information for all distributions in one go.

The "Testers" line shows a summary of the CPAN Testers results for the release you're looking at: the three numbers are number of passes, number of fails, and number of unknowns. Doug Bell is the current lead of the CPAN Testers project, and he was also at meta::hack. You can read his blog post about the work he did to make it easier for MetaCPAN to take in CPAN Testers data.

The reviews line is a summary of any reviews on the CPAN ratings web site. The Kwalitee link will take you through to the CPANTS summary page for the distribution, with details of its Kwalitee score (an indication of how well the distribution is following CPAN conventions).

This side panel gives a useful summary when you're considering depending on. Does it appear to be actively maintained? What do other people think of it? How reliable is it across a range of operating systems and Perl versions?

There are various scripts which pull information from these sources. Behind the scenes, we’re now using Minion to manage the various jobs that are involved in the indexing process.

Physical organisation

Up to now (ie MetaCPAN v0), MetaCPAN has been running on a single server, which hosted Elasticsearch, the MetaCPAN API, and the search interface. This obviously had implications for the reliability of MetaCPAN, and one of our key goals for v1 was to make the service more resilient.

With MetaCPAN v1 (which became live during meta::hack), Elasticsearch is now running in a cluster of three nodes. Here's a simplified view of the setup that we'll be running under very soon:

MetaCPAN machine cluster

Fastly have signed up as a MetaCPAN sponsor, and will be providing load balancing and caching for both the search interface and the API (we're not quite there with spreading the search interface and API across the three servers, but hopefully that will happen soon).

CPAN Index as a service

MetaCPAN v1 has also introduced a new download_url endpoint, which cpanm (and other CPAN clients) can use in order to get the package URL for a module. For example, to get the download URL for HTTP::Tiny

fastapi.metacpan.org/v1/download_url/HTTP::Tiny

If you go to this URL, you'll (at the time of writing) get back the following JSON document:

{
   "version" : "0.070",
   "status" : "latest",
   "date" : "2016-10-10T03:25:33",
   "download_url" : "https://cpan.metacpan.org/authors/id/D/DA/DAGOLDEN/HTTP-Tiny-0.070.tar.gz"
}

You can see that this essentially provides the CPAN Index (and a bit more) as a service, mapping a module name to a release tarball and associated metadata. It supports searching by version, version ranges, dev releases and even declaring versions which you don’t want to have. For example:

https://fastapi.metacpan.org/v1/download_url/Moose?version===0.01
https://fastapi.metacpan.org/v1/download_url/Moose?version=!=0.01
https://fastapi.metacpan.org/v1/download_url/Moose?version=<=0.02
https://fastapi.metacpan.org/v1/download_url/Try::Tiny?version=>0.21,<0.27,!=0.24
https://fastapi.metacpan.org/v1/download_url/Try::Tiny?version=>0.21,<0.27&dev=1
https://fastapi.metacpan.org/v1/download_url/Try::Tiny?version=>0.21,<0.27,!=0.26&dev=1

Thanks very much to Clinton Gormley for writing this at the QA Hackathon in Berlin. We’ve now finally been able to release it.

About Booking.com

Booking.com B.V., part of the Priceline Group (Nasdaq: PCLN), owns and operates Booking.com, the world leader in booking accommodations online. Each day, over 1,200,000 room nights are reserved on Booking.com. Booking.com has supported Perl in countless ways over the years, and employs many well-known CPAN authors, including Sawyer X, Steffen Mueller, Philippe Bruhat, Mickey Nasriachi, Rafaël Garcia-Suarez, Yves Orton, Stevan Little, and an awful lot more (you can see a hopefully complete list in the source of ACME::CPANAuthors::Booking).

Leave a comment

About Neil Bowers

user-pic Perl hacker since 1992.