April 2019 Archives

Perl Toolchain Summit 2019 - CPAN Dependencies Graph

By Grinnz on April 30, 2019 6:59 PM

I was grateful to attend for the first time the Perl Toolchain Summit, held this year in Marlow, UK at the Bisham Abbey. I got to meet many of the talented and persistent contributors to the Perl CPAN infrastructure, and also see a country outside North America for the first time. The Perl Toolchain summit is a great event, made possible by the organizers and sponsors, that enables contributors of the Perl CPAN infrastructure to get together and do important work. You can see the results of my project this year at https://cpandeps.grinnz.com (example).

The Project

I decided for my first project, which took the majority of the summit, to work on a replacement for the Stratopan dependency graphs, which have unfortunately gone AWOL. I used these graphs often from MetaCPAN to visualize the dependency impact of a CPAN distribution.

First, some background information on CPAN dependencies. A CPAN distribution, a set of modules uploaded together in one release, specifies its dependencies on modules (not on distributions). This allows you to specify what modules you use, and not care about what distribution they may be provided by, which is especially useful if they change distributions later. This is also important because CPAN permissions and indexing are maintained by module name; you don't want to depend on something and later have a rogue upload take precedence when people try to install your dependencies. Dependencies are retrieved the same way as when you initially try to install a module with a CPAN client: the package index is used to determine what distribution release provides the latest version of that module, and that tarball is downloaded and installed (with its dependencies, and so on). Thus, a dependency graph would show the set of distributions that are ultimately required to install and use a distribution, via these intermediary module dependencies.

To start, I decided on my approach. Dependency graphs are constructed from source data that must be recursively requested, and are accessed far more often than the source data changes, so a caching layer is a no-brainer. I picked Redis, as it is a breeze to use and extremely performant for this type of task. For the graph display, I considered what type of representation would best suit the data; the simple circular display used by the Stratopan graphs was quite hard to glean information from aside from the total size of the dependency chain, so I aspired to make something more useful. Because a dependency chain is not a simple tree, and may have distributions with the same dependency, or even distributions that depend on something higher up in the tree, I looked for solutions for displaying a directed network graph (it is not quite a Directed Acrylic Graph, or DAG, because there may be cycles, but in most cases the DAG representation works fine). I also wanted it to be interactive, so a JavaScript library would be needed, and most visualization libraries for the web are in JavaScript anyway. After reviewing a few different JavaScript libraries, I settled on Cytoscape.js as it (currently) seems to be maintained and to provides the style I was looking for, with lots of configurability, and the ability for the user to move nodes around, zoom and pan.

Day One

With that design in mind, on the first day of the summit, I set up the caching layer. As in my other projects of a similar web nature, I took the approach of building a Mojolicious::Lite script and tacking on commands, plugins, templates, and static files as needed. I would want a cron job to easily be able to pull and cache the dependency data periodically to keep it up to date, so I added a command to wrap this functionality.

The actual dependency data would be pulled from MetaCPAN via MetaCPAN::Client, and stored in Redis with Mojo::Redis. To get the data from the Mojolicious application to the JavaScript library to display, I created an API endpoint that returned JSON representing the nodes to display, and wrote some JavaScript that read the distribution name from the URL parameter, and used the Fetch API to query the API, then initialized the cytoscape.js object with the response. Immediately I found the main limitation of creating a graph in this way: it requires a div of a fixed size in which to generate the graph, thus it does not resize itself after the initial page load, and deciding what size to make the div might be tricky. It is not a huge problem, so I continued on.

Day Two

On the second day of the summit, I worked on the caching functionality. I initially had been using the MetaCPAN API to retrieve both the dependencies of each distribution (which are module names), as well as the distributions that each of these modules belonged to, in order to continue populating the graph. But this led to some inaccurate data; some modules may be provided by multiple "latest" distribution releases according to MetaCPAN, but only one of them is canonical, and this ruling is made by the CPAN packages index (also known by its filename, 02packages). It turns out I already have set up an API into the packages index, but I did not want to query it for individual module names one at a time -- this would be far too much overhead. So I added functionality to the CPAN Meta Browser API to be able to retrieve the packages index data for multiple modules in one query, and set up the dependency caching to use that API to retrieve distribution names.

One more problem became apparent when many core modules starting showing up in my dependency trees. The Stratopan graphs hid core modules, and it was a good idea, because you wouldn't need to install these, as long as the version of Perl you were using provided a new enough version to satisfy the dependency. But I wanted to allow the user to specify what version of Perl they were using and make the judgment based on that, so I needed to amend the caching to also store each module and version dependency individually along with the distribution they mapped to. Then on the request side, I used Module::CoreList to determine which dependencies could be excluded from display.

Day Three

On the third day of the summit, I worked on the interface. I wanted the user to be able to tweak different settings for the graph: whether it would show build or test dependencies in addition to runtime, whether it would show recommended or suggested dependencies, and now what Perl version's core dependencies should be hidden. I also wanted the user to be able to choose different graph styles, as I started to play with distributions that had wildly different layouts of dependencies and found that different graph styles worked better for different distributions. I ended up defaulting to an 'auto' style which would choose the top-down style unless any distribution has more than 10 dependencies, in which case the concentric-circle layout seems to be more useful; I may tweak this 'auto' style further as needed. I added a form to change these settings, using Bootstrap so that I did not have to think about styling, as I have done in my other sites. As the JavaScript was already reading the distribution name from the URL query parameters, I simply had the form submit with the GET method, and had the JavaScript read the additional settings from the query.

I also tackled the fixed div problem: because the graph needed a fixed space, and the Bootstrap form would not necessarily be a fixed size, I used a viewport-percentage width and height to maximize the space the graph would render in, giving it 85% of the viewport height on load, and wrapping the form in a div using the rest of the height, allowing it to create a scrollbar if it overflows in smaller viewports. It is not a perfect layout solution, but it is good enough for now.

I also was inspired by Andreas König, author and maintainer of PAUSE and the core CPAN client, who suggested an output format like the CPAN Dependencies site at CPANTesters, set up by David Cantrell, but which during the summit was unfortunately not working for comparison. It showed dependencies in a tabular layout where each level of dependency is indented, and I found that adding such a view option would be a simple matter of adding another template to consume the data retrieved from the cache. I first set it up as a separate page from the graph, but as I found that mostly I wanted to share the same form as the graph page, I integrated it as another "style" that could be chosen on the main page. Since this layout has no reason to use JavaScript, in this style the template directly populates the table and so JavaScript is not needed to display it. It is quite unwieldy for large dependency trees, but a nice option for smaller ones.

Finally, although I had run the command to go through and cache all distributions currently known to MetaCPAN, which took some hours, continuous updates would be needed so the graphs would show correct data as things change on CPAN. With help from Mickey, the author of MetaCPAN::Client who was conveniently sitting nearby, I added an option to the caching command to cache all releases since a certain time, and set up a cron job to cache new releases every 3 hours.

Day Four

By the last day of the summit, I was mostly satisfied with the status of the CPAN Dependencies Graph site, which is now live at https://cpandeps.grinnz.com. I added a cron job that caches random distributions each day with the hope that it would eventually take care of situations where a dependency module changes distributions, though I may need a better solution for this. I also quickly added a feature where entering a name containing a double-colon (which would clearly be a module name) will redirect to the graph for the distribution providing that module. It may be that the project could be moved to MetaCPAN infrastructure in the future for increased reliability, but it is rather resource light and so my VPS is more than capable of hosting it for now.

I started work on another project I had considered for the summit: refactoring perldoc.pl's rendering to allow it to cache the HTML rendered from POD, so it does not have to be rendered on demand for each request. This is a problem for large pages like perltoc which can take an excessively long time to render from POD, and simply unnecessary work since each Perl version's POD will not change. I managed to organize the code by the end of the day such that I will be able to hook in for both storing the rendered HTML, and retrieving it to stuff into the template when requested.

Closing

I am very glad for the opportunity for myself and others to work on the toolchain of Perl in this focused environment. The event ran quite smoothly thanks to the efforts of the organizers (Neil Bowers, Philippe Bruhat, and Laurent Boivin), and would not be possible without the sponsors.