The Last Great Repository Problem

Alias on 2007-01-21T10:32:19

I get bored easily. To fight this boredom, I find myself attracted to grand problems. The sort of problems so wide in scope they can hold off boredom for years.

Hence my heavy involvement in the CPAN.

CPAN is a hugely interesting thingy to play with, and I've enjoyed fixing some of its problems, or at least coming up with solutions for them that work (not all of which have been implemented yet).

Of the remaining CPAN internal problems, most are now just optimisation and scaling issues. We have solutions, they just need to be pushed in different directions. For example, the index problem (which Debian has as well) where the index is starting to get annoyingly large and downloading it every time is getting more painful.

The core of the CPAN however, which is essentially an implementation of a "3-uniform complete infinite directed acyclic hypergraph with property rights" works well, and is as sophisticated as it needs to be.

PITA, toolchain boot-strapping, mirror auto-detection, getting rid of FTP transports and so on, are just rounding out the capability of this implementation to make it more robust and reliable.

But there remains one last great unsolved problem.

And this is what from Perl's perspecive could be called the "external dependency problem".

That is, we can map depedencies on other Perl modules and fulfill them just fine, but the CPAN client can't install dependencies for other languages.

With the increasing likelyhood of large-scale cross-language dependencies in the future (and the existing problem we have already of dependencies on C libraries) this is going to become a bigger and bigger problem.

After a lot of discussions with various folks at Linux.Conf.Au from several languages and Linux distributions, I think we might finally have a first approximation of a solution to this problem.

Now, depending on the approach there are really only two viable solutions to the dependency problem.

Firstly, to make the installers of every language co-operate. I'd believed this to be nearly insurmountable, but happily Perl hacker and Debian packager Angus Lees from Google's Dublin admin group came up with an improvement on my method for attaining installer interoperability.

However, the concept in general still has problems, not least the vastly differing underlying methodologies for different installers, plus things like mention platform issues.

Some languages like Python can't handle variations in dependencies on different platforms very well or dependencies on language versions (although they are in planning for the third phase of their repository efforts, so who knows).

The second, and more preferable option (for now at least), is to instead attack the problem from the perspective of integrating all source repositories more tightly to all the downstream binary repositories, for the various operating system distributions.

Specifically, we need to provide the ability for the people like Debian, Redhat et al to automate the mass-production of binary packages.

Ironically, they will then still need to solve the problem of interoperating multiple installers, but at least they only have to make it work ONCE, at packaging time, and they have much better options for falling back on humans to deal with corner cases than if we had to intergrate source installation.

To achieve this, the best option for starting to deal with this seems to be the creation of a universal grammar for describing dependencies between arbitrary software packages.

While not necesarily used natively by each source repositories (although languages with weak packaging like ocaml might adopt it natively anyway) this would be a secondary format that the installers/packages for each source repository could emit on request or as needed.

The downstream binary auto-packager could then parse the metadata grammar, establish if those packages exist already as binary packages, and map the deps to the native binary packages appropriately.

In the case they aren't available, the binary auto-packager could them abort the process, and recurse to try and binary pack the dependencies first.

This wouldn't necesarily let us package ALL of CPAN for the downstream repositories, but it should certainly let us take an application like Jifty or Plagger and bulk-create any packages that don't already exist, crossing language boundaries as needed.

And with all upstream repositories using a common format, we can supply not just to Debian, but to ALL downstream binary repositories far more easily, and ultimately, regular users will be installing by binary packages.

Except that now, they can actually get a useful module coverage, instead of the current deplorable situation.

As a side effect, in future we will have existing code available from these auto-packagers for doing multi-repository recursion, which we can then applied to doing a more-difficult multi-language source installer as well.

Andrae Muys from the Mulgara project (the most scalable RDF database currently available) was kind enough to stay behind after LCA and spend Saturday with me, fleshing out a first cut at an RDF grammar for this metadata format (initially intended as a companion vocabulary to the highly-adopted DOAP).

DOAP is kinda neat, but WAY too oversimplified. For something with such wide adoption that claims to be aimed at package metadata for open source projects, you'd think the DOAP people would have put in a call to CPAN at some point.

The first goal is going to be some sort of proof of concept joining the two richest and most developed source and binary repositories together. That is, to join CPAN to Debian.

The second goal will be to repeat the process with two vastly different repositories, probably something like ocaml or erlang on the source side, and something like Fedora on the binary side. So two source and two binary repositories all connected together.

If we have something suitable for handling those, we can then take the format to the wider community. Linux Australia seems interested in some sort of "Packaging Summit" to achieve this.

And then with another year or so work, we can hopefully round the language out into something suitable via a standards-like process, and come up with a final grammar.

If we're successful, this should mean 10,000+ additional modules for downstream distributions, and finally give us a real chance of having "all" (for some definition) of CPAN on your operating system of choice.


The Leibniz Problem

chromatic on 2007-01-21T18:15:45

the best option for starting to deal with this seems to be the creation of a universal grammar...

Wow. Good luck.

Different Degrees of Dependencies &Other Quest

DAxelrod on 2007-01-22T18:08:49

This is very, very shiny. Thank you.

How are you planning to deal with the fact that different repositories have different semantics for describing dependencies? (For example, not all CPAN modules make the distinction between dependencies needed at install time and runtime).

What assumptions are you making about a CPAN module distribution? (I started an effort to try to document exactly how a distribution was made up, but had to leave it for other things a few months ago, looks like I should start working on it again.)

What assumptions are you making about a module's build system?

How are you planning to deal with module tests?

Do you have a website/mailing list/hive mind where you're coordinating all of this?

Re:Different Degrees of Dependencies &Other Qu

Alias on 2007-01-23T11:24:40

> How are you planning to deal with the fact that different repositories
> have different semantics for describing dependencies? (For example, not
> all CPAN modules make the distinction between dependencies needed at
> install time and runtime).

Some of this is still unsolved.

Currently I think there will probably be built-in vocabulary for dependencies that covers 5 phases (config, build, test, install, runtime) and some concept of a resource with various subtypes (packages/distribution, class/interface, etc etc) and various version mapping (cpan's "equal or greater" or "only" etc).

> What assumptions are you making about a CPAN module distribution?
> (I started an effort to try to document exactly how a distribution
> was made up, but had to leave it for other things a few months ago,
> looks like I should start working on it again.)

As few as possible at the moment. Andrea understands the process of developing a vocabulary better than I do.

> What assumptions are you making about a module's build system?

For the moment we are assuming that for a single source package, you follow the following steps.

1. Configure
2. Build
3. Test
4. Install

This seems to be fairly consist across most languages.

> How are you planning to deal with module tests?

We plan to run them :)

More specifically, they aren't likely to be refered to in the grammar, as they aren't relevant beyond any test-time dependencies.

> Do you have a website/mailing list/hive mind where you're
> coordinating all of this?

Nope. This is very very early days at the moment, and I plan to move slowly.

There is however some half-finished doodlings sitting at

http://svn.phase-n.com/svn/cpan/trunk/PIG/

And there's probably some digital camera shots of whiteboards around somewhere.

I don't plan to put too high a priority on this project for now, as it's about 90% interpersonal communication and not really much to do with coding. And projects that involve communication tend to take a while.

I'd expect this to take in the 2-3 year range to get to a sufficiently acceptable 1.0 release of the grammar (assuming the concept holds).