An introduction to MetaCPAN's use of Elasticsearch
This is the second in a series of articles, which we're writing to celebrate meta::hack, our first MetaCPAN hackathon, which is currently (Nov 17th through 21st) taking place in Chicago.
This hackathon was by invitation only, since it had a very specific goal: completing migration of the live service to MetaCPAN v1 (which includes a major Elasticsearch upgrade, from 0.20 to 2.4, or nearly 70 stable releases forward). Once that's done, any remaining time will be spent fixing bugs, and discussing what comes next. The attendees are Olaf Alders (founder of MetaCPAN), Mickey Nasriachi, Leo Lapworth, Tom Sibley, Joel Berger, Doug Bell, Brad Lhotsky and Zach Dykstra. Matt Trout is contributing remotely.
This post is brought to you by cPanel, a platinum sponsor for meta::hack. cPanel are a well-known user and supporter of Perl, and we're very grateful for their support. More about cPanel at the end of this article.
Many of you reading this will be familiar with RDBMS concepts such as databases, tables, columns, and indices. Some of those terms are used with Elasticsearch, but with different meanings.
A cluster is one or more servers (nodes) that hold all the data for your application. Currently MetaCPAN has a single server; with the switch to version 2 of Elasticsearch we will also have a 3-machine cluster. This will improve performance and reliability (we've had a few outages over the last few years).
An index is a collection of related data. You might have one index for product data, and another for user data, for example. It most closely maps to a database with an RDBMS. A cluster can host one or more indexes.
Each index contains one or more types. A type is a logical partition of your data, similar to a table in the RDBMS world.
A type is a collection of JSON documents.
You can read more about Elasticsearch concepts on Elastic's web site.
MetaCPAN currently has two indexes.
The main index, currently known as cpan_v1, holds all the information about CPAN distributions, modules, authors, and related information. This index holds most of the data which is exposed via the API.
The second index holds information about users (of MetaCPAN, not to be confused with CPAN authors, even though many people are both CPAN authors and MetaCPAN users). This is where we store private information such as session data, and references to other user accounts (github, twitter, etc). The data in this index is appropriately exposed via the public API and is used by the search interface to manage user sessions.
For example, to see your user data (you won’t be able to access someone else’s), log in to v1.metacpan.org (note this URL will redirect once we are live) and then open fastapi.metacpan.org/v1/user/ in your browser.
MetaCPAN defines a number of Elasticsearch types, most of which represent CPAN entities you're probably familiar with (read the CPAN Glossary if you're not). The API can be used to get the data associated with any particular entity (module, release, author) and the search interface renders that information into a web page, as we'll illustrate below.
The author type contains information about CPAN authors, which is aggregated from a number of places, and some of it provided by users. This is the basis of the author pages provided by the search interface. For example, Olaf Alders has pause ID OALDERS. His public author page on MetaCPAN is https://metacpan.org/author/OALDERS . You can see the JSON document for Olaf in the author type via https://fastapi.metacpan.org/author/OALDERS.
The release type contains information about all CPAN releases. When you search for a distribution on MetaCPAN, you end up looking at the latest release. For example, the HTTP::Module is released in the HTTP-Tiny distribution. You can look at the release page at https://metacpan.org/release/HTTP-Tiny, and you can see the JSON document for the latest release of HTTP-Tiny at https://fastapi.metacpan.org/release/HTTP-Tiny. If you look at the JSON document, you'll recognise that a lot of the data comes from the metadata file included in the release (META.json and META.yml).
The module type has information about one particular CPAN module (remember that a release can contain more than one module). You've probably read about Buddy Burden's Date::Easy module. You can see the JSON document for this module at https://fastapi.metacpan.org/module/Date::Easy, and the MetaCPAN page for the module is https://metacpan.org/module/Date::Easy. You can see that some of the information shown on that page actually relates to the distribution, Date-Easy.
The file type has information about files in CPAN distributions. Modules are also files, so hopefully you won't be surprised to find out that modules are the same as files, from an Elasticsearch perspective.
The other types include distribution, rating, and favorite.
The simplest way your code can get information from the MetaCPAN API is to construct URLs like those shown above, and then use JSON::MaybeXS to convert the JSON document into a Perl data structure. In a later post in this series we'll show how you can use the MetaCPAN API to search for things more flexibly and iterate over the results.
cPanel is a web-based control panel for managing a web hosting account. It provides a simple yet powerful interface for managing email accounts, security, domains, databases, and more. cPanel was originally written (in Perl!) by Nick Koston in 1997. cPanel (the company) now employees over 200 people, and Nick is their CEO. They've been using Perl for nearly 20 years now, and have long been supporters of Perl and its community. You may recognise some of their developers who are CPAN authors: Todd Rinaldo (TODDR), Reini Urban (RURBAN), Mark Gardener (MJGARDNER), Xan Hilmisdóttir (XAN), and others. Their CEO, Nick, still develops in Perl today.