gitPAN and the PAUSE index

schwern on 2009-12-12T22:02:03

As you may or may not know, people on CPAN own modules (technically they own the namespace). Each Foo::Bar is owned by one or more CPAN accounts. Usually you gain ownership on a "first-come" basis, but it can also be transferred. Only the "official" tarball for a given namespace is indexed. So if the owner of Foo::Bar uploads Foo-Bar-1.23.tar.gz Foo::Bar will point at Foo-Bar-1.23.tar.gz. If I (presumably unauthorized) upload Foo-Bar-1.24.tar.gz the index will still point at Foo-Bar-1.23.tar.gz.

Here's the rub. Not owning a module doesn't stop you from uploading. It also says nothing about who owns the distribution. gitpan is by distribution. Now it gets a little more difficult to figure out who owns what. For example, look at MQSeries-1.30. All but two modules are unauthorized. BUT notice that MQSeries.pm is authorized. The CPAN index does point MQSeries at M/MQ/MQSERIES/MQSeries-1.30.tar.gz (everything else is at 1.29). Likely what we have here is a botched ownership transfer.

How do you mark that? search.cpan.org seems to take the strict approach, if anything's unauthorized its out. The CPAN uploads database I have available is the opposite, if anything is authorized its in. What to do?

Then there's stuff like lcwa. Looks like junk, but here's the thing. CPAN has a global module index to worry about, gitpan doesn't. Each distribution is its own distinct unit. So lcwa does no harm on gitpan, it can be recorded.

What does matter? The continuity of a distribution's releases, and this is precisely what CPAN does not track. It doesn't even have a concept of a distribution, just modules inside tarballs. CPAN authors playing nice with tarball naming conventions gives the illusion of a continuous distribution.

So... for a given release of a distribution (ie. a tarball), how does gitpan determine if the release should be included in the distribution's history? If we go strict, like search.cpan.org, we're going to lose legit releases and even entire distributions (like lcwa). If we let anything in gitpan is not showing an accurate history.

Add the complication that authorization changes. For example, the MQSeries module ownership will eventually be fixed. What then?

First pass through, gitpan is ignoring this problem. Its just chucking everything from BackPAN in. Second pass will rebuild individual repos with collected improvements. This is the first thing I'm not sure what to do about.

Suggestions?


Historical PAUSE permission records?

Aristotle on 2009-12-13T10:37:43

It seems to me historical PAUSE permission records are what you need. Given such, along with BackPAN, you could reconstruct the permissions for any given package at any point, and go from there.

It might be conceivable to map authors to branches in the repos in some way if that makes sense.

The question is: what does it mean if I upload an unauthorised release 1.2 of a module whose version 1.1 is on CPAN and authorised, and then that author releases 1.5? Did he use the changes from my 1.2, ie. is it one continuous lineage? Or did he ignore me completely, meaning I’m on a side branch by myself? Basically the question boils down to: if there are parallel branches, where are the merge points?

If all authors kept proper changelogs then you can figure that out; otherwise, humans looking at diffs can probably infer the right thing in many cases as well. But neither method seems amenable to machines.