CPAN like it's 1995

bart on 2009-01-08T22:29:25

(title inspired by a blog post)

CPAN.pm used to ask a bunch of questions the first time it is run. One of the questions is what CPAN mirrors to use.

Now it doesn't any more: it comes preconfigured. But that comes at a price: a lot of distros simply assume use of http://cpan.perl.org/ or of http://www.cpan.org/ (while several perl ports use their own private CPAN mirror by default, such as Strawberry Perl for Windows, and, I thought, ActiveState's ActivePerl.

Is the idea of using CPAN mirrors simply outdated? Or, should the CPAN client be smarter, and figure out for itself which mirrors to use? The latter feels like overkill to me. It presumes inclusion of a geolocator module and database, like Geo::IP (the free version of that database is far more than sufficient for this purpose, so the license price is no objection). But having that module and database on every Perl installation, just to get a list of mirrors once, or maybe a few times, in the lifetime of a perl installation, really is far too much.

I can remember how http://www.perl.com/CPAN, thanks to Tom Christiansen IIRC, used to have a built in redirector, where it figured out where in the world you are, and hence, which (single) mirror to use. But if that one mirror was offline, you were out of luck. It didn't check the status of the mirror, it just redirected you there.

If we still wish to use mirrors, why not drag CPAN into the age of webservices? (actually we're already late for that, as the age of webservices seems to have passed, already... :)) Set up a main page on a site, for example on www.cpan.org, where CPAN.pm can simply ask "Can you suggest me what mirrors to use?" (pun intended). Then only the central site needs to have this geolocation database, to check what part of the world the request comes from, and compose a list of preferable mirrors. The output could be as simple as a text/plain page with one URL of a mirror per line, returning maybe 5 or 10 URLs in total. Easy to generate, and dead easy to parse.

(Note: the order of mirrors that are close to each other in level of preference could be randomly shuffled for each request, to avoid that all users in one area all hammer the same mirror.)

CPAN.pm can still be made a bit smarter, and for example, use ping to test responsiveness of the mirror, or, simpler still, time the fetch time of a page from the currently chosen mirror, and check if it's fast enough (depending on your internet connection; it should keep track of responsiveness of the mirrors, so it can compare them); and switch the order of mirrors, if that may, likely, seriously improve matters.

I like it

Mr. Muskrat on 2009-01-08T22:41:16

So when do you think you can have a proof of concept written? :D

Why is this a problem?

jj on 2009-01-09T08:38:33

Why do you think this is a problem?

Letting CPAN.pm configure itself with sensible defaults is a good thing - it creates a better user experience, and if for whatever reason you want to use a specific mirror it can easily be set using the "o conf urllist" command.

Re:Why is this a problem?

bart on 2009-01-18T21:16:51

CPAN appears to take real pride in the fact that it has a huge network of 222 mirrors. Is that pride justified? Or is having such a network of mirrors just an outdated (1995), almost ridiculous concept?

I doubt whether anybody still modifies the default settings for CPAN.pm, once it works. I know I don't. That means that currently maybe say 70% of all installations use the same 3 or 4 repository servers, and that percentage can only just go up.
Do we really have to maintain the mirror network? Or can we think of dismantling it? You could argue that, now it works, that keeping the mirror network running comes at no cost at all. Fine.

If it is still useful to have this huge network of mirrors, then we maybe we ought to spread the load, automatically.
BTW I noticed the other day that http://cpan.strawberryperl.com/ automatically forwards to http://cpan.yahoo.com/, probably a server that is more up to the heavy load of supporting everybody on the planet. Running a repository for the whole planet may have proven not to be trivial, after all.
But my earlier idea of setting up a webservice for the mirror list may actually be a far more complicated solution than necessary. Perhaps it suffices if people's CPAN clients would just round-robin in the list of mirrors in their (sub-)continent, or each time, randomly pick a mirror from that same list.
Gathered download metrics could still be useful to adjust the weights for randomness, so a large server can automatically get more utilized, than a small one.
p.s. Oh, there are only 220 mirrors now, it seems that 2 have dropped off in just a few days.

Media type

Aristotle on 2009-01-09T12:41:39

The output could be as simple as a text/plain page with one URL of a mirror per line

Did you mean: text/uri-list

FTPstats.yml

srezic on 2009-01-09T19:47:50

CPAN.pm already writes some statistics about downloaded files into FTPstats.yml. This can be used to calculate the download speed and maybe re-configure the urllist in CPAN/Config.pm. For a quick start:

perl -MYAML::Syck=LoadFile -e '$x=LoadFile shift; for (@{$x->{history}}) { warn ((-s "sources/$_->{file}") / ($_->{end}-$_->{start})) }' FTPstats.yml