CPAN::Index 0.01: From CPAN to JSAN and back again

Alias on 2006-09-26T06:29:45

CPAN (or rather the entire code ecosystem that sits above the Perl 5 language surface and just happens to include "CPAN", the group of services which happens to include the FTP site that is the actual "The CPAN") has some issues.

It always seems to be having issues. Nothing breeds issues like success.

These issues are often the basis of people wanting to do the Next Big Thing and throw the whole CPAN design away and start again. But apart from the new demands coming from the Perl 6 language surface (like multi-versioning) CPAN has more or less got things right.

There are two big issues that CPAN has that we don't have working solutions for you. Namely large scale automated testing, and cross-language external installer integration (i.e. To use apt or rpm or ports or PPM or whatever reliably).

Both of these things don't require a CPAN rewrite. I'm working on the first one myself, and others are starting to look more at the second (although I don't think we have any solid solutions yet)

There's also a host of small issues lingering here and there, that can be fixed, but just haven't seen enough attention yet.

Things like mirror bootstraping and auto-select, inter-service data sharing(index accessibility, metrics, statistics, etc), and a number of smaller QA-related issues (which the perl-qa@perl.org list are gradually working their way though).

When we created the JavaScript Archive Network ( openjsan.org ) one of the things I really wanted to do was to take my experiences with the CPAN and what I considered some of it's small issues to be, and use JSAN as a "do over". In the process of making JSAN, I had a chance to try out some ideas I've wanted to apply to the CPAN, but without the legacy issues.

Some of these worked very well. Getting rid of FTP with it's varying implementations and moving to only HTTP provided a much better transport layer.

Local storage was implemented as a mirror of a logical subset of the remote mirrors (much like a half-filled minicpan), with builds done in a seperate directory.

The JSAN metadata was multi-indexed, so you could pull it as YAML or a SQLite database file (and potentially other formats). And the mirrors had a dedicated small mirror.conf file which had the sole job of allowing the client to reliably identify a JSAN mirror, tell it's age, and then work out how fast it was (although that part still needs work).

Of course, the JSAN metadata is statically resolvable and much simpler than the CPAN metadata. The problem on CPAN is that in order to do anything much you need a host host of different functionality implemented, and it's very easy when doing one apparently simple task to end up needed a ton of logic.

This logic isn't always easy to access, because a lot of the pieces of the CPAN ecosystem have been custom-written for specific tasks, and so isn't easily reusable.

For example, minicpan started as a script for a single task, then evolved to be more and more formalised, and then subclasses appeared, and so forth. But it retains it's "Verb" heritage. Minicpan doesn't use any generic CPAN objects, it just follows it's own process.

I've just uploaded CPAN::Cache 0.02 and CPAN::Index 0.01.

CPAN::Cache is a backport from JSAN of a class to implement a locally-stored logical-subset of a remote mirror. It has some custom functionality to know which files will never change and which might, and encapsulates it all in a simple API.

CPAN::Index provides programmatic access to the CPAN metadata. It stores the data in a local SQLite database and wraps a DBIx::Class layer over the top of it, does inflation to various object forms, and so on.

When the index needs files, it uses a CPAN::Cache to get it.

Now all this sounds pretty heavy, but the dependencies this stack uses are all platform-neutral, so installation should be a breeze. And because it doesn't keep the index in memory, on my box it only uses about 18meg of RAM, compared to over 50 for CPAN.pm and up to a hundred for CPANPLUS (although I'm told that has been reduced now).

Dependencies are only a problem when 1) They fail to install 2) They are part of the toolchain bootstrap sequence.

And let me be clear that CPAN::Index and friends are NOT intended to be replacements for CPAN or CPANPLUS. They are for use in all the other side tasks that could benefit from a richer implementation.

And more importantly, it's memory use is constant with regards to the index size, so the problem of memory use just goes away.

This first release has very little actual data in the actual schema, only authors and packages. It's more to help me get into the release rhythm and let other people have a squiz at the code.

I plan to expand this schema first to modules (and replicating the module name -> dist path logic), and then outwards again to merge in things like the CPAN Testers data, CPANTS, RT, CPAN Ratings and so on.

The functionality in CPAN::Index::Loader pulls the data from wherever, parses it, and then repopulates the database.

What I'm heading towards is one API that wraps over all the metadata from all the different services and places that hold fragments of the canonical data.

One API for accessing all the data of "CPAN".

This can then be reused any time it would be nice to have access to that data, in places like Module::Collection (my current need), for seperating the CPAN dependencies from third-part ones in a random group of distributions, and so on.

Name for the CPAN ecosystem

DAxelrod on 2006-09-26T22:14:10

the entire code ecosystem that sits above the Perl 5 language surface and just happens to include "CPAN"

We need a name for this. Would CPAN Ecosystem do?

Re:Name for the CPAN ecosystem

Alias on 2006-09-26T23:39:31
CPAN Ecosystem probably doesn't cover it accurately enough. Perl 5 Ecosystem might cover it more accurately, but it's really Larry's call. I don't even know if he has a term for it.

I've generally found the surface concept to work well, even up to conversations with Audrey and Larry.

The problem with naming is that it would need to include things like the Perl core, or rather the non-essential modules like File::Spec that are dual-lifed.

It includes Bundles, and the CPAN client itself, and Inline, and a hundred other bits and pieces that are distinct works in their own right, and not necesarily part of the CPAN, but which form distinct elements of the richness.

But I think there is qute possibly a need for a distinction between The CPAN that people like aevil talk about (just the FTP and mirrors and such) and what the layman perceives as CPAN, which includes rt.cpan and search.cpan and the dozen other services.

Perhaps "CPAN Cloud" or some such? :/