Working with the MetaCPAN API

This is the fourth article in a series about MetaCPAN. The first article described the two main parts that make up the MetaCPAN project: the API and the search interface. The second article gave a high level overview of how the API uses Elasticsearch to hold and search information about CPAN distributions and authors. The third article showed how MetaCPAN fits into the rest of the CPAN ecosystem.

In this article we'll show how you can use the MetaCPAN API to get information about releases to CPAN. We'll start off with a very simple query, then gradually refine it to narrow down which releases are returned, and what information you request for each release.

This article is brought to you by Elastic, who were a Gold sponsor for meta::hack. We were very happy to have their support, especially given the central role that Elasticsearch plays in MetaCPAN.

Working with the API

The MetaCPAN API is a thin layer on top of Elasticsearch. As a result you can query it directly using the Search::Elasticsearch module from CPAN, which was written by Clinton Gormley, who works for Elastic. But for most uses we recommend that you use MetaCPAN::Client, as it gives a more CPAN-oriented interface.

In the second article, we introduced Elasticsearch types, which are a way of partitioning an index according to what information is in each document. The main types in MetaCPAN are release, module, author, and file. For this article we're going to look at the release type.

Each document in the release type represents one release to CPAN. Here's a URL which will let you look at the full document for the most recent release of David Golden's HTTP-Tiny distribution:

fastapi.metacpan.org/release/HTTP-Tiny

In the Elasticsearch world, when you run a query you get back a scroller, which lets you step through the results. The MetaCPAN::Client handles the scrolling for you, and gives you an iterator, which for our examples lets you process the results release by release.

Give me all releases

Let's start with the simplest query: give me all releases.

use strict;
use warnings;

use MetaCPAN::Client ();

my $mc              = MetaCPAN::Client->new;
my $release_results = $mc->all('releases');

while ( my $release = $release_results->next ) {
    printf "%s v%s\n", $release->distribution, $release->version;
}

This will take a long time to run, as it will eventually work through everything ever released to CPAN, and for each release it returns the full document for that release.

You almost never want this kind of overkill, so we'll narrow down to the specific releases we're interested in, and then only ask for the information we're going to use. This reduces the load on the API, so please always only ask for exactly what you need.

Only on CPAN

Not everything that's ever been released to CPAN is still on CPAN. When a release has been superseded, it is often (but not always) deleted by the author. A BackPAN mirror is like a CPAN mirror, but it has a copy of everything ever released to CPAN. MetaCPAN uses the status 'backpan' to indicate a release that has been deleted from CPAN.

So now let's ask to only get the releases which are still on CPAN. We do this by excluding backpan releases:

use MetaCPAN::Client ();

my $mc              = MetaCPAN::Client->new;
my $release_results = $mc->release({ not => { status => 'backpan' } });

while ( my $release = $release_results->next ) {
    printf "%s v%s\n", $release->distribution, $release->version;
}

If you run this, you'll notice that for some distributions you'll see a lot of releases going by. When we said old releases are not always deleted by the author, that was a little white lie. In fact, old releases often sit in author directories, until the call goes out for a clean-up.

Latest release only

So instead, we'll ask for just the latest release for each distribution. The query line is now:

my $release_results = $mc->release({ status => 'latest' } });

The 'latest' status means "the latest release of a distribution that's still on CPAN".

You can also pass 'cpan' for status, which effectively means "on CPAN, but isn't the latest release".

Requesting only the information you're going to use

At the moment we're still pulling back the full document for each release, even though we're not using most of it. We can pass a second argument to the release method, and specify the fields from the document that we're interested in:

my $release_results = $mc->release({ status => 'latest' },
                                   { fields => [qw/ distribution version /] });

This puts less load on MetaCPAN, and your code will be quicker, since a lot fewer bytes will be coming over the wire.

Constraining the search

For our final example, we're only interested in Olaf's modules, and only those with a github repository. Here's the full example:

use strict;
use warnings;
use MetaCPAN::Client;

my $mc    = MetaCPAN::Client->new;
my $query = {
        all => [
            { author                      => 'OALDERS', },
            { status                      => 'latest' },
            { 'resources.repository.type' => 'git' }
        ]
    };
my $limit = { '_source' => [qw/ distribution version resources /]};

my $release_results = $mc->release($query, $limit);

while ( my $release = $release_results->next ) {
    printf "%s v%s\n\t%s\n",
        $release->distribution,
        $release->version,
        $release->resources->{repository}->{web};
}

Notice that we're not not using the 'fields' construct now, but are using '_source'. You might be tempted to try the following, as I was:

my $fields = { fields => [qw/ distribution version resources /] };

A restriction in Elasticsearch means that you can't list fields that have structured values (arrays or hashes, in Perl speak). If you look at the document for HTTP-Tiny, you'll see that the resources part of the document looks like this:

  "resources" : {
      "homepage" : "https://github.com/chansen/p5-http-tiny",
      "bugtracker" : {
         "web" : "https://github.com/chansen/p5-http-tiny/issues"
      },
      "repository" : {
         "type" : "git",
         "url" : "https://github.com/chansen/p5-http-tiny.git",
         "web" : "https://github.com/chansen/p5-http-tiny"
      }
   }

So instead we're using the Elasticsearch feature called "source filtering". For the simple data cases, this isn't as efficient (read: fast) as specifying fields, but it will work on all parts of the document. To keep things simple you could just always use '_source' to constrain what data is pulled back. It's useful to know about the "fields" option though, because you might see it used in MetaCPAN and other code.

A couple of things to notice about this example:

  • You can have as many clauses as you like with the all operator.
  • You can match to a particular field in the document by specifying the path to the field (as with 'resources.repository.type')
  • Where the field is a hash or an array in the JSON document, you'll get back a hashref or arrayref.

There are more examples available online, and if you get stuck, there's usually someone who can help you on the #metacpan IRC channel on irc.perl.org.

Thanks to Olaf and Mickey and others from the MetaCPAN team for their help writing this series, and catching my errors.

About Elastic

Elastic is the company behind the Elasticsearch open source search and analytics engine, along with Logstash, Kibana (which lets you visualize your Elasticsearch data), and Beats (lightweight agents for sending data to Logstash or Elasticsearch). The company was founded in 2012, though work started on Elasticsearch in 2010. Elastic have supported Perl and MetaCPAN in a number of ways -- one of their employees Clinton Gormley (DRTECH on CPAN) helped Olaf start the work to move to Elasticsearch v2 at the 2015 QA Hackathon.

Leave a comment

About Neil Bowers

user-pic Perl hacker since 1992.