Recreating a Perl installation with MyCPAN

A goal of the MyCPAN work was to start with an existing Perl distribution and work backward to the MiniCPAN that would re-install the same thing. I hadn't had time to work on that part of the project until this month.

The first step I've had for awhile. I've created a database of any information I can collect about a file in the 150,000 distributions on BackPAN. There are about 3,000,000 candidate Perl module or script files. That includes basics such as MD5 digest of the file, the file size, the Perl packages declared in the file, and the package versions.

The next step is what I've been doing this week: collect the same information on the files in a Perl installation, which is much easier to do. There's not wacky distribution stuff involved.

Putting those two together should find the distributions that could make up the installation. With that list of distros, it's just a matter of creating the right 02packages file that a CPAN client can use. Easy peasy, I thought.

But, it's not that easy. Each file in the existing installation might have come from several distributions. That is, between different versions of a distribution, it's likely that many of the modules didn't change. So, looking at a single file doesn't lead to a single distribution. It might list several possible distributions.

But that's a start. Other files from that distribution should be present, and they each might come from several distributions even if one of them changed. If there's any file that only belongs to one distribution, that collapses everything for that distribution. If not, I have to find the overlap in possible distributions. There should be one distribution that overlaps more than all of the others, and that should be the right distribution.

That's not quite right either though, because some distribution versions don't change the module files. They update a test or the build file or something besides whatever is in lib. You'd think that at least the $VERSION would change, but think of any exception and you'll probably find it on BackPAN. That's not as horrible as it seems though. If all of the module files are the same, it doesn't matter which distribution I use, does it?

But then, there are some files that not only might come from more than one version of a particular distribution, but might also be in a completely different distribution. Some distributions have lifted files from other distributions. Files from the URI and LWP modules show up in other distributions. How should I figure out which one should be the candidate distribution?

The database I was using was just an extract of all of the information I have on each distribution and it's oriented to individual files. I select records to match up MD5 digests. However, when I get records back with different distributions, which one might be installed? If an installed file might have come from both Foo-Bar and Baz-Quux, I have to remove one of the distributions somehow. In that case, I have to step back to look at what else either distribution might have been installed. If the other files from Foo-Bar aren't there, it's probably not Foo-Bar.

That might be the end of the story, but what if both Foo-Bar and Baz-Quux are installed? That part I haven't figured out, but it's likely that the previous step will be inconclusive since the files from both distributions will all be there. However, there's also the chance that an older version of Foo-Bar and a newer Baz-Quux is there. If they both install a file, the older version in Foo-Bar might have been over written by an updated version from Baz-Quux. So, Every file except one from Foo-Bar is there. That means that there's possibly some path independence there so I would have to make sure I install modules in the right order to recreate the installation.

If the module installation order matters, I think that might rule out creating a Task::* distribution, which can't guarantee the installation order, I think. A Bundle::* might be able to do it though.

So, you think that's the end of it? Think about configure_requires and build_requires. Anything those need to be in the MiniCPAN too, even if they aren't in the installation. You have the option of not permanently installing those modules, so you might not see them in the analysis. Even when I get a list of distributions, I then have to check their dependencies to see if there's anything extra I need to add.

So, not so bad.


I'm guessing MyCPAN::Indexer and MyCPAN::App::DPAN the modules this all refers to (just incase anyone else wants the links)

Leave a comment

About brian d foy

user-pic I'm the author of Mastering Perl, and the co-author of Learning Perl (6th Edition), Intermediate Perl, Programming Perl (4th Edition) and Effective Perl Programming (2nd Edition).