How can I grep all of BackPAN?

How can we make a Perl code search so I can grep all of CPAN? I would have done this with the now-dead Google Code Search, which used to make this part of world's information and universally accessible and useful.

Specifically, I want to look at every instance of META_MERGE in Makefile.PL in every distribution in BackPAN. I can easily program this task with my DPAN stuff since I already have a way to crawl CPAN and look in every distribution. That would take a couple of days to go through 250,000 files (although maybe I should try this with Archive::Extract::Libarchive, which speeds up the main bottleneck in this technique.

I thought about GitPAN for a few seconds, but I want to search things that are not in HEAD, too.

MetaCPAN API might be able to do it, but I don't think I can search on information that isn't already indexed. I can search most of the meta-things about Perl distributions, but I want to use a regex across all files in BackPAN.

Sometimes I've thought about a PPI-based search engine where you could search by what something is. For instance, I want to find all subroutines named log instead of just searching for the text "sub\s+log" and so on.

I'm sure someone in Perl-land already does something similar with all the right technologies that we could use to make this available to everyone.

I could just unpack all of BackPAN and use `find` and `grep`. That might actually be the easiest way to do it locally, where "easiest" is the least work for me to answer this single question. I'd have to wait a bit, but all that stuff is asynchronous.

6 Comments

Google Code Search "quietly" moved to http://code.google.com/codesearch

grep.cpan.me does not seem to include BackPAN at the moment. Of course, you could just clone the repo and run its code locally against BackPAN.

Tried it just now, looks like it doesn't index CPAN at all.

grep.cpan.me is awesome, but I think if you're going to run it for the backpan, you'll need a machine with insane amounts of RAM. Last I heard, the regular grep.cpan.me was using a good chunk of a 16GB RAM server. Backpan being many times larger, I think you'd want 96GB or more. Good luck.

Alternatively, do a full backpan extract on you large harddrive and install the released Google-code-search-alike software locally. Don't have the link handy, but if you can't find it, poke me and I'll get it from a co-worker who's a big fan.

I think as Steffen says using the released code that implements the trigram index that codesearch used (http://code.google.com/p/codesearch/) would be interesting. Depending how well it works I might be interested in adding that to grep.cpan.me.

You might also be interested in this code that acme wrote and then I added threads to (yes I know, but it actually works quite well in this case): https://gist.github.com/1279181 – it can search a local CPAN fairly fast, should be possible to modify it to search a backpan.

Leave a comment

About brian d foy

user-pic I'm the author of Mastering Perl, and the co-author of Learning Perl (6th Edition), Intermediate Perl, Programming Perl (4th Edition) and Effective Perl Programming (2nd Edition).