OrePAN2 Processes MetaCPAN Lookups in Chunks

The issue

In the last month of 2016, I was assigned OrePAN2 in the CPAN Pull Request Challenge. When browsing its issues on GitHub, I discovered #47:

Right now we cannot easily rebuild a minicpan with a lot of modules because the MetaCPAN lookup fails.

The problem is here: OrePAN2::Indexer line 148

This code needs to break up the query after X releases have been pushed onto the @file_search stack. I don’t have number handy, but trying to rebuild the minicpan will yield it fairly quickly.

AC:

OrePAN2 can accommodate lookups for an arbitrary number of modules

For testing the MetaCPAN behaviour, see the use MetaCPAN subtest in t/06_inject_live.t.

The logic which needs to be tweaked is in OrePAN2::Indexer::do_metacpan_lookup(). We could create an accessor that sets a threshold on how many modules to search on in @search_by_archives. If the number of files we need to look up exceeds the threshold, then we need to loop over the MetaCPAN search logic in order to get everything we need. The accompanying test could inject 2 files into the $tmpdir and then use a very low threshold (like 1 archive) in order to force the looping behaviour.

If both releases are found in $orepan->_metacpan_lookup then we have a green light.

Cannot Reproduce

One one hand, the issue’s complexity seemed to be medium, exactly what I felt able to solve by the end of the month. On the other hand, the description smelled of micromanagement: the steps to fix the issue were explained in detail, but the issue itself wasn’t given much focus.

As usually, I wanted to first reproduce the problem; then write a failing test for it; and then make the test pass by fixing the code. I created several CPAN mirrors, but I wasn’t able to reproduce the problem: I thought 1000 would be the threshold, as both MetaCPAN::Client::Request and MetaCPAN::Client mention

size => 1000,

But MetaCPAN lookup worked for me even with 1020 distributions. So, after some hesitation, I decided to just follow the instructions.

The code that processed the lookup consisted of two consecutive loops. The first loop gathered information from all the releases, while the second one iterated over the modules corresponding to the releases. The chunked processing just wrapped both the loops with a simple

    while (@search_by_archives) {
        my @search_by_archives_chunk
            = splice @search_by_archives, 0, $self->metacpan_lookup_size;

As specified, the test needed a way to specify the threshold. Have you noticed the metacpan_lookup_size in the previous snippet? Yes, that’s it. The default is set to 200, but the tests uses 1.

    my $orepan = OrePAN2::Indexer->new(
        directory => $tmpdir,
        metacpan => 1,
        metacpan_lookup_size => 1,
    );

The test passed (but it didn’t fail with the old code, either), so I created a pull requested and asked for proper testing including the verification that the old issue was fixed where reproducible. The pull request was later merged, but I’m still not sure it’s really fixed the original problem. Anyone able to reproduce the failure in the older version (0.45) being fixed in 0.46?

The Newton Tube Experiment

The situation reminds me of our physics teacher at the high school: “Today, I’m going to show you the famous Newton Tube experiment. I have the tube here, it contains a ball and a feather. In the first part of the experiment, the tube is filled with air, and you can see that the ball falls faster than the feather. In the second part of the experiment, we should pump the air out of the tube and see the feather fall as fast as the ball. Unfortunately, my pump is broken, so we’ll skip this part. In the third part, the air is let back into the tube, and you can see the ball fall faster again. We’ve seen two thirds of the famous Newton Tube experiment.”

We had to believe, but we wanted to really see.

3 Comments

The reason the big report is detailed about the fix is because it was created by the maintainer.

So, what you got was me dumping a story from an internal bug tracker into a public bug tracker (GitHub). The micromanagement you see was me detailing how to go about dealing with the issue so that other developers on the team at my $work who are not as familiar with the internals of OrePAN2 would quickly be able to find an appropriate starting point.

I did not add a reproducible test case because the easiest way for us to reproduce this was via our internal tooling. It was not a pressing issue, so I didn't spend a lot of time getting to the root of it, but I felt like there was enough detail for someone else to get a sense of what needed to be done.

Having said all that, this may or may not have been the actual bug which was causing the problems internally, but this part of the code still needed to be fixed. It was a naive assumption on my part that a MetaCPAN query could be arbitrarily large. The MetaCPAN API (or Elasticsearch itself) might impose limits at some point that could break this query, so it made sense as a form of future proofing to set an arbitrary limit on how many modules to query and then to loop over the results.

Thanks very much for your work on this. It's appreciated. It was as much a feature request as it was a potential bug fix.

Leave a comment

About E. Choroba

user-pic I blog about Perl.