gitPAN

schwern on 2009-12-03T23:55:03

If you're like me, and I know I am, you've often wondered things about other people's CPAN modules like: what changed in this release; when did this bug/feature get introduced; where's that old version that got deleted off CPAN?

search.cpan.org provides some web tools, which is very cool, but pointy-clickies only go so far. What you really want is a repository of releases.

Sometimes you can find the project's repository, usually involving digging through the documentation. Now projects are starting to use the repository resource in their metadata and search.cpan.org links to it so things have gotten a little better. And maybe its complete, or maybe the history cuts off where the last maintainer took over. And maybe they've tagged their releases in some sane way.

Wouldn't it be nice if every CPAN distribution had a repository of all their releases, all tagged the same way? The idea has been kicking around for a while. Eric Wilhelm took a stab at it with Subversion, but its less than trivial to get a useful history out of a pile of tarballs with SVN. And then where do you host it? Turns out git makes this process trivial. You delete all the files, unpack the new release, and commit it all. Git figures out what moved, what got deleted, what got added, etc. So that's one part solved.

Then brian d foy has been working on indexing BackPAN. Leon made a module to access this index, Parse::BACKPAN::Packages. That's another part solved.

Yanick developed a pile of tools to make turning a CPAN distribution into a git repository easy including one to import all the releases from BackPAN. Put a loop around that and call it done.

Finally, hosting. I'm never one to DIY system administration, so plop it on github. Their APIs make creating repositories trivial and their web site provides far more functionality than I'd ever want to maintain. And, perversely, once I get tagging working you can download tarballs! Unfortunately BackPAN is about 20 gigs and while the size of the resulting git repos is looking to be far smaller (projects with a lot of releases come out much smaller, projects with few releases come out a little larger) it still bleads well over their 300M free account limit. Hopefully they'll be receptive to a little begging.

I give you gitPAN, a (soon to be) complete set of repositories for all of BackPAN. The process is fully automated, but I'm still tweaking things and the available repositories are sporadic. There's a lot of optimization and small corrections which needs done, my tweaked versions of Parse::BACKPAN::Packages and Git::CPAN::Import are available.

There are two open problems. First, I haven't even looked into how to keep the repositories up to date. There's some new indexes on BackPAN as part of the File::Rsync::Mirror::Recent mirroring optimization Andreas has been working on which will probably prove useful. If code suddenly appeared to handle that that would be great.

Second, I know of no historical index of authorized releases. This means gitPAN will just pull in everything on BackPAN causing a slightly skewed history. If a solution to that appeared, that too would be great.

I don't have any clear idea of what this might be used for, nothing to justify its scale. But I figure make the data available and someone will do something awesome with it. "If you build it they will come." If nothing else it'll make patching easier, I've already started generating gitPAN repos for modules I'm about to patch and cloning that to work on, but hopefully this will be more than an extended yak shaving exercise.


You do now :)

barbie on 2009-12-04T00:03:51

Second, I know of no historical index of authorized releases

See the "CPAN Testers Upload Database Generator" section on the CPAN Testers Development site. You probably want the SQL database version :) (PS: it's updated hourly ;))

Re:You do now :)

schwern on 2009-12-06T00:11:17

Awesome! I'll see what I can glean from it.

Keeping up to date

drhyde on 2009-12-04T12:22:27

This is "easy". Simply mirror BackPAN using rsync, and then find all files newer than the last time you haven't already gittified.

Re:Keeping up to date

schwern on 2009-12-04T23:41:43

The part of my brain still living in 1996 cringes at the thought. The part living in the future remembers "find" can do this in 20 seconds.