The Zen of Comprehensive Archive Networks

hfb on 2002-11-12T16:13:55

jhi writes "It seems that there is a lot of interest in having similar archives for other languages like CPAN [1] is for Perl. I should know; over the years people from at least Python, Ruby, and Java communities have approached me or other core CPAN people to ask basically 'How did we do it?'. Very recently I've seen even more interest from some people in the Perl community wanting to actively reach out a helping hand to other communities. This 'missive' tries to describe my thinking and help people wanting to build their own CANs. Since I hope this message will somehow end up reaching the other language communities I will explicitly include URLs that are (hopefully) obvious to Perl people."

[1] http://www.cpan.org/

I'll start negatively and end with hopefully more constructive notes, however these will build on the denials.

In the following Mumble and mumble stand for any other language than Perl or a combination of languages other than Perl.

First, the negative statements.

CPAN shall not 'piggyback' other languages. (There shall not be a mumble/ top level directory.)
- Rationale: CPAN is CPAN is CPAN. CPAN carries Perl. This implies all kinds of different contracts, explicit and implicit.
- Some people in the Mumble community will take offense to CPAN carrying Mumble.
- Some people in the Perl community will take offense to CPAN carrying Mumble.
- Some CPAN mirrors will take offense to suddenly having to carry also Mumble.
- Some CPAN mirrors will become resource (bandwidth, disk) constrained after having to suddenly carry also Mumble.
CPAN cannot 'piggyback' other languages.
- The building blocks or 'plumbing' of CPAN (the basic directory structure, the PAUSE) is a reasonably good match for Perl. I'm not so certain that it is for all the other languages.

Now, on to the hopefully more constructive suggestions.

First and foremost-- I'm not against other language communities having a CPAN. I would love to have such archives. I'm willing to help the other language communities. I'm only against too straightforward "let's just slap it on to the CPAN" solutions to the problem. Other languages are not like Perl, they are different, to a smaller or larger degree. Let's allow them their own degree of dignity and careful thought.

Then on to the technical questions, a.k.a. "How did you do it?" Well, people always ask that from me and I go speechless... "Errrr, ummm, I kind of pulled all this stuff together and organized it a bit, and put it on a ftp server". After this a brooding silence always falls... "And...?" ... "And what?" ... "That's it?" "That's it."

Well, that's not really it, of course. The above is how CPAN started. How it grew is another story. First, Larry designed Perl to grow by letting it have modules (in other words, namespaces). Then we had a couple of wise men (like Tim Bunce) to have the vision of good module naming guidelines. Finally, we had Andreas König who single-handedly wrote PAUSE [2], the module submission machinery, where Perl module authors can register, submit, and manage their submissions. This allowed for a rapid but still controlled growth of modules. Because of the growth, it finally became too arduous to know what was out there, and luckily Graham Barr's scratch to this itch become large enough to be published as search.cpan.org [3]. Later backPAN [4] was added by Andreas to hold all the old versions of submissions deleted by their authors; this ties back into simple basic things that the master server(s) must have, like good backups. Last but not least, feedback for the modules can given through the RT ticketing system set up be Jesse Vincent.

[2] http://pause.cpan.org/ (or https://pause.cpan.org/) [3] http://search.cpan.org/ [4] http://history.perl.org/backpan/ [5] http://rt.cpan.org/

CPAN mirrors [6], then? How did they come about? The original ones, dozen or so, were easy: I just asked the maintainers of the original ftp sites I had found the seeds of CPAN from whether they might be interested in carrying this slightly bigger amalgamated Perl archive. Well, they foolishly agreed... I have to remind people once again that CPAN was conceived as a FTP archive. Not a website. And it still is that way. search.cpan.org just gives a nice interface. I'm sorry but I'm a dry CS engineer, not a graphic designer. Information, not animation.

[6] http://mirrors.cpan.org/

Oh, back to the CPAN mirrors. After the original ones, we grew slowly for a while, by word of mouth in the Perl community. However, since this was the time before the billions dollars worth fiber dug into the ground, Internet connections were still a bit dodgy and spotty. Therefore I started doing two things: scanning ftp logs for sites that obviously were mirroring CPAN but were not registered mirrors, and sites that were good representatives for their particular top level domain, especially outside the big seven TLD. This way I could track down where Perl was used and by asking those sites to participate to push back the load from the master site. Later I also filled in missing countries by going for sites like the sunsites, and other vendor/public funded sites that had a good chance of having good connectivity. Usually I could find a sympathetic soul, oftentimes a system administrator.

Summary of the mirror tirade: I went for sites that liked and/or used Perl. I have no way of knowing off-hand whether they would like Mumble. The mirrors are donating their network and storage capacity and some amount of their administrative time for the Perl community. If we would like to extend that in any way we would have to ask them, from all of them individually.

You can learn more about CPAN's history from the Perl timeline [7]. Things didn't happen overnight.

[7] http://history.perl.org/PerlTimeline.html

A quite important thing for both the authors and the users is that the language must get the naming scheme of its modules right, or at least reasonably close. Perl's/CPAN's is far from perfect, but at least it was once designed, and it has been enhanced over the years as new needs have appeared. A good naming scheme allows hierarchical browsing, gives good hints for search engines (a good name is effectively a string of uniquely identifying keywords), and coordinates community efforts. Some sort of conflict resolution mechanism in case of competing and identically named implementations is important. Keeping all those guidelines well documented and all these processes public is important. One naming issue I think Perl 5 got wrong is that module namespaces are first-come-first-served, two or more different authors cannot have an identically named module. This may lead into unintentional or intentional squatting, which is not good for the community.

When designing your author/module/whatever hierarchy think scalability. We originally got it wrong by having all authors as subdirectories in one single directory which quickly became a bottleneck. (The solution to this was simply to 'hash' based on the leading two characters of the user ids.) Think also several different views to your data: by author, by module, by category, by date, by keywords, and so forth. Don't think only hierarchical views will be enough: you will need searching capabilities.

Get your license policy clear from the day one. No, day minus one. In this day and age it is very important that every piece of software gets clearly marked as to what license it carries. Build your module packaging tools so that they suggest, maybe even demand that the author picks a license. This way both the users of modules and distributors of software wanting to include the module don't have to keep guessing.

Very much related to the licensing is of course commercial use: CPAN took the easy and clear policy of no commercial software of any kind, not even share/guilt/donateware would be allowed. We felt that any other policy would be open to nitpicking, or maybe even legal challenges, and as a volunteer ragtag group we had no time or other resources for any of such.

Security? Should you have PGP keys and triply-written-in-blood signatures? Maybe. Currently CPAN has only MD5 checksums-- but so far they have been enough. There are some ongoing projects that enable using PGP keys for verifying the origin of the software; but as always with PKI systems, bootstrapping the web of trust is hard, some say even not worth the trouble.

Code quality? Ratings/reviews? Moderation/metamoderation? "Approved" SDKs? These all are hotly debated subjects and will not be addressed here since the CPAN is and will stay an open and free forum, where the authors decide what they upload. Any further selection belongs to different fora.

The scripts that maintain the CPAN are dreadfully simple. They are just simple shell scripts that copy sites A, B, ..., Z to the CPAN master site at ftp.funet.fi, launched from cron. Many of them use Ye Olde Original mirror.pl, some of them are just rsyncs. No magic. I really don't have anything to give away, no magic bags full of powerful CPAN spells. The most complex script I think I have is the script that probes the mirror sites for uptodateness-- and even that is not rocket science, just multiplexing ftp and http downloads and comparing timestamps. If someone wants that code, they can have it.

Andreas has the webserver code for PAUSE available online. That code is slightly more complex than the core CPAN scripts, or the scripts supporting the PAUSE; but even here, the code is there. No tricks up our sleeves.

There is no magic. All it takes is a few people that sit down and get first something running, a rough cut. Then iteratively enhance it. Perhaps the most demanding thing is commitment: someone must keep things running. A slowly decaying and dusty archive is almost worse (and certainly more sad) than no archive at all.

Oook and out.

Jarkko Hietaniemi, the CPAN Master Librarian

CPAN.pm

Dom2 on 2002-11-12T17:58:54

Don't forget CPAN.pm. That's the other end of the stick that lets people access the stuff inside CPAN in the most incredibly useful and lazy fashion.

I honestly feel that without CPAN.pm, CPAN as a whole would not be as popular as it is today. Look at something like HTML::Mason, which has a half dozen dependencies. CPAN.pm Just Takes Care Of It[tm].

-Dom

P.S. I know that CPAN.pm has many flaws, but it makes up for them by 1) being useful and 2) being installed by default with perl.

Re:CPAN.pm

jhi on 2002-11-12T18:35:16
Oops, a very good point, thanks. But since Andreas himself didn't notice that omission in his proofreading of the article I don't feel that bad about forgetting it... :-) But yes, an automatic network-aware installation mechanism, with the dependency resolving and checksum checking, is a very important step.

I guess I'll start maintaining the master copy of
this article somewhere at CPAN, once the feedback settles down.

Re:CPAN.pm

Elian on 2002-11-12T21:31:14
While CPAN.pm is important (as is search.cpan.org, and the DNS magic Ask has set up) they're really less important than many people think.

It's not been that long since I managed systems where CPAN.pm was completely unusable, and I had to do it all by hand--FTP, make, and the like. No automatic dependency checking, no fetching, no module lists, nothing. (Plus it was five miles uphill in the snow to the nearest mirror!) The only thing available was the base mirror functionality. And with that... CPAN was phenomenally useful. Yeah, sure, I had to track dependencies by hand, but that wasn't a big deal.

The CPAN mirror net and the naming conventions that were set up at the beginning are what provides the power, and drives the toys. CPAN.pm, CPANPLUS.pm, and all the rest are useful, certainly, but without the underpinnings they'd be useless.

That so few people have ever gotten around to building a CPAN says something about the other language communities. (Likely that too few of them are sysadmins) As Jarkko said, it's not rocket science, and it's not even all that tough to get set up. Pity it happens so rarely.

Re:CPAN.pm

hossman on 2002-11-12T22:47:55
CPAN.pm is what makes it possible for people who aren't admins, don't know how to become admins, and don't want to be admins to install perl modules.

There may be some modules that don't install cleanly, or have strange external dependancies that they don't make clear ... but those are special circumstances of the modules. The bottom line is: if I write a pure perl module, and someone wants to use it, they can install it without needing to know anything other then perl. As far as I'm concerned that's the most powerful part of CPAN.

The mirroring and archiving and PAUSE are great, but the fact that anybody can install CPAN modules, without even needing to know what "make" is, or how to run "ftp" is what makes CPAN really great -- it gives all of those modules users.

If someone wanted to try and duplicate the success of CPAN in another language, I think they would HAVE to have a "client" that worked just as well as CPAN.pm

Re:CPAN.pm

Elian on 2002-11-12T23:21:24
I'm not knocking CPAN clients like CPAN.pm or CPANPLUS. I like 'em both, and I'm glad they're around, but... they're not what's made CPAN a success. They built on the success that CPAN had. Yes, they moved CPAN to a new level, and that was good, but the base mirror net and its infrastructure is what's made this all possible. The rest are (nearly inevitable) conveniences, bells, and whistles--often massively useful, but ultimately optional.

Re:CPAN.pm

hfb on 2002-11-13T02:31:24

Actually, I'd say the one thing that really 'made' CPAN was search.cpan....once it caught on it made CPAN accessible to a much wider audience which is why it is so often confused for the archive itself. CPAN was a success just by existing at a time when you had to ftp to 15 different sites just to get the kit you wanted for your systems. CPAN.pm made it convenient and search.cpan made it navigable and less intimidating for those a lot less familiar with CPAN. WAIT and UWinnipeg had been around for at least 2 years before search.cpan but something about the interface and design made it take off in spite of some of the problems I had keeping the box running when the load increased well beyond capacity. It is a very interesting chapter in the annals of perl.

Re:CPAN.pm

rafael on 2002-11-12T23:29:29
Besides CPAN.pm, you should not forget MakeMaker, that has simplified the process of building, installing, testing and packaging a module, portably, and consistently. People started to install modules from CPAN because they were easy to install ; they started to upload modules to CPAN because it was easy to produce an installable tarball.

MakeMaker

Dom2 on 2002-11-13T11:35:01
Agreed. MakeMaker is an important part of this process. In fact, I hold high hopes for python modules now that they have had distutils.py for a year or more. It'll make building an archive for them much easier. I think ruby has something similiar as well, but I'm not sure...
-Dom

It's easier than you think!

brev on 2002-11-12T23:53:45

When I was working for ActiveState, I got to observe other language communities try (and try, and try) to duplicate CPAN.

They failed with depressing regularlity by making it overcomplicated, or centralizing the work too much.

Decentralize!

If you want a community-based system, make the community do as much of the work as possible. No bottlenecks. The one centralized thing in PAUSE->CPAN is a mailing list which approves some changes in the naming hierarchy. This usually works ok but even now some people are frustrated with it.

The proposal for Perl 6's CPAN is that authors should be allowed to write modules with the same name. Joe's "foo.mumble" shouldn't pre-empt Bill's "foo.mumble". This sounds frightening, but I've thought about it a lot and it actually isn't.

Keep it simple!

The Python folks wanted to make an Zope-based archive that was maintained by experts who precompiled modules for various platforms. So there would be no compilation or building step for the users. Installation would happen with web services trickery and so on.

These experts would also exercise their judgment about which modules were good, which modules to approve upgrades for, etc.

Too complicated! And too centralized!

Your first goal is to make it easy to submit code and redistribute it. Ease of use and quality control are not the central problems you are trying to solve here.

Standardize!

But, (you say) I really want my language's archive to surpass the ease of use of Perl's CPAN.

Here's how: build a stable, simple base that the rest of the community can write hooks for.

Standardize on ways of installing the module and testing it. And finally, there should be a standard way of obtaining the documentation from the module.

Perl has imperfect, but widely-adopted standards for all of these, and that is what makes tools like CPAN.pm or Search.CPAN.org or ActiveState's ppm possible.

But keep it extensible!

There are a lot of things we *ought* to have standards for in perl modules, but my goal here is to convince you to keep it simple.

So rather than mention them, I'll just advise you to keep a master file for each module, in some format that allows you to add extra fields later.

Namespaces

Dom2 on 2002-11-13T11:37:47
I actually think that Java got it right when they used domain names as part of the namespace. And really, I agree with Tim B-L when he says that we should be using URIs to identify such things.
Mind you, I don't want to go down the route of SGML catalog files. That's too much like hard work.
-Dom

Re:Namespaces

jhi on 2002-11-13T12:56:13

I think the problem with using domain names is using domain names... that is, you make an implicit assumption that everybody

(a) is in possession of a domain name that they want to use and can use for tagging their software with

(b) tagging stuff with domain names may mean tagging stuff with trademarks and other legal stuff which may turn out to be a burden later if the software needs to be renamed for any reason.

Other way of putting it is that using domainnames works okay-ish for stabl-ish organizations like big companies, it might be less convenient for more mobile entities. A smaller nit is that the DNS structure might give you wrong trust in class hierarchies that don't exist (this, of course being a Perl specific feature of naming). Especially in larger organizations the hierarchy of DNS names might reflect very poorly the organizational (or software production) hierarchy.

Summary: I think DNS was a step in the right 90 degrees sector, but not in the right direction. DNS names are reasonably nice as opaque IDs, and their uniqueness is guaranteed by an outside entity, and they give you a reasonable rough idea who was produced the software, but that's where it end. I think any module naming hierarchy should be independent of DNS.

Using URIs would suffer from much of the same problems as using DNS. It would be a little bit better since the administration of the namespace can be divided more, but still...

In fact I think the key for naming is to kiss goodbye to hierarchies. I think the Rule One of Hierarchies is that you can never get hierarchies right. If you think you got it right, you probably didn't ask enough people for their opinions. One should think in rules. A module should be identified by a (primarily) unordered set of rules: name, author, version, and so forth.

Easier than you think (easier to read)!

brev on 2002-11-13T00:10:21

When I was working for ActiveState, I got to observe other language communities try (and try, and try) to duplicate CPAN.

They failed with depressing regularlity by making it overcomplicated, or centralizing the work too much.

Decentralize!

If you want a community-based system, make the community do as much of the work as possible. No bottlenecks. The one centralized thing in PAUSE->CPAN is a mailing list which approves some changes in the naming hierarchy. This usually works ok but even now some people are frustrated with it.

The proposal for Perl 6's CPAN is that authors should be allowed to write modules with the same name. Joe's "foo.mumble" shouldn't pre-empt Bill's "foo.mumble". This sounds frightening, but I've thought about it a lot and it actually isn't.

Keep it simple!

The Python folks wanted to make an Zope-based archive that was maintained by experts who precompiled modules for various platforms. So there would be no compilation or building step for the users. Installation would happen with web services trickery and so on.

These experts would also exercise their judgment about which modules were good, which modules to approve upgrades for, etc.

Too complicated! And too centralized!

Your first goal is to make it easy to submit code and redistribute it. Ease of use and quality control are not the central problems you are trying to solve here.

If your solution necessarily involves databases or web servers, I respectfully suggest you are making it too difficult. You can distribute CPAN on a CD-ROM.

Standardize!

But, (you say) I really want my language's archive to surpass the ease of use of Perl's CPAN.

Here's how: build a stable, simple base that the rest of the community can write hooks for.

Standardize on ways of installing dependencies, installing the module, and testing it. And finally, there should be a standard way of obtaining the documentation from the module.

Perl has imperfect, but widely-adopted standards for all of these, and that is what makes tools like CPAN.pm or Search.CPAN.org or KobeSearch possible.

But keep it extensible!

There are a lot of things we *ought* to have standards for in perl modules, but my goal here is to convince you to keep it simple.

So rather than mention them, I'll just advise you to keep a master file for each module, in some format that allows you to add extra fields later.

final note:

I helped write the current generation of ppm, ActiveState's tool to precompile modules server-side and make it easy to install them client-side.

Some people believe that it should be that easy, from day one, in their languages.

But! That tool is only a going concern only since there were a lot of standardized modules to begin with, which made it worth ActiveState's time to devote the extra effort write a layer on top of CPAN.

Obviously, one shouldn't have to rely on a for-profit company to write a tool like ppm. But the cost-benefit problem is the same. Make it easy for there to be multiple front-ends!

Re:Easier than you think (easier to read)!

jhi on 2002-11-13T14:54:49
Thanks, excellent points. When I update my document I'll underline the KISS principle with bold strokes, and also the issue of precompiling to byte/object code.

RT Ticketing

brockgr on 2002-11-13T01:27:28

Just though I would add how much I like the RT system for CPAN. It's of great use to people like me on minority platforms (Solaris ;-).

Pity the search.cpan.org web interface changed about a week after RT was released. Still the blue and white is nostalgic.

One thing though - looks like ther SSL certificate for rt.cpan.org has expired on November 7th.

Gavin

Re:RT Ticketing

jesse on 2002-11-13T17:06:45
Sorry. Fixed.

Source Code

2shortplanks on 2002-11-13T10:46:49

At work we use CPAN.pm to install our own non CPAN modules. Essentially this involves examining the modules building compatible index files with a simple script...

I'd love to have a look at the source code for PAUSE and the other systems. Is it on the CPAN somewhere, or somewhere else online?

There's also CTAN

m2 on 2002-11-13T14:04:03

The Comprehensive TeX Archive Network. For some reason I've always thought that CTAN preceeded CPAN, but I'm not really sure which one was there first. Like CPAN, CTAN was conceived as a FTP-based service and then the web came and people moved on and you know the rest. Since I use both CTAN and CPAN on a regular basis, sometimes I find myself wishing CTAN to be more CPAN-like. The CTAN Catalog is superb, but I think the killer CPAN feature is the ability to browse the documentation in a nice easy to read format. (La)TeX packages have great documentation, but you sort of requiere a DVI or PS viewer, which aren't really documentation browsing tools (cut and pasting code is hard or impossible).

Thank you for bringing CPAN to life, it's just wonderful.

Re:There's also CTAN

jhi on 2002-11-13T14:46:06
CTAN was there first, and we freely acknowledge our debt and inspiration in naming :-)
CPAN is "only" seven years old, while CTAN is, gee, older than that. I can't off-hand find out how old CTAN is.

Re:There's also CTAN

jschrod on 2002-11-15T17:07:28
We (George Greewade in the US, Sebastian Rahtz in UK, Rainer Schöpf and myself in Germany) built CTAN in 1992. It was "officially" announced at the EuroTeX conference in Aston, 1993.
CTAN was an effort to bring together the separating ftp servers with TeX material. I'm proud to say that it was triggered by a podium discussion I organized at the EuroTeX conference 1991, in Paris. George came up with the name CTAN, I think I have his email still somewhere in my archives. I got involved since I ran one of the largest ftp servers in Germany at this time (ftp.tu-darmstadt.de) and had heavily modified mirror.pl from Lee for this purpose.
The CTAN site structure were actually put together at the start of 1992 (Sebastian did the main work, as he did later with TeXLive), and synchronized at the start of 1993. The TeX Users Group provided a framework for this task's organization, there was a Technical Working Group for this purpose.
Cheers,
Joachim Schrod

Re:There's also CTAN

jhi on 2002-11-15T19:16:07
Thanks!

(This history somewhere in the CTAN website would be neat.)

> CTAN was an effort to bring together the separating ftp servers with TeX material.

Sounds so very familiar...

> and had heavily modified mirror.pl from Lee for this purpose. :-)

If you CTAN guys would have any comments and/or suggestions to give for the "ZCAN" article I would be more than happy to incorporate them.

Re:There's also CTAN

jschrod on 2002-11-15T17:12:37
Effort in CTAN package documentation actually goes into TeXlive. The new distribution has most of the documentation in PDF. HTML is not practical, since most of the documentation will demonstrate some layout example.
Maybe we'll find sometime the volunteers to transport this effort back to CTAN.
Actually, there's a lot in CPAN we'd like to have in CTAN as well, and never got around it. Most important, something similar to PAUSE, and commonly agreed upon package structures.
Sigh, so much to do, so few time. (w/ apologies to jwz.)
Cheers,
Joachim