Suggestions for merlyn's mirror CPAN program

jdavidb on 2002-10-16T16:27:32

I've been running my CPAN.pm shell sessions with a file:// URL for awhile now, thanks to merlyn's recent CPAN mirror program (which pulls the bare minimum to create a usable CPAN repository: only the most recent modules). I noticed today though that CPAN.pm is still going to my previous first choice repository to download and compare checksums.

So, I added the following:

  • Declare an %authors hash just before the while gzreadline loop
  • After calling my_mirror on the module distribution file, get just the directory name with dirname, and add it to the hash as a key
  • For each key in the hash, call mirror (not my_mirror) on "authors/id/$key/CHECKSUM"

Here's the actual patch, but you might prefer just the description:

--- mirrorcpan.perl.old	2002-10-16 11:03:32.000000000 -0500
+++ mirrorcpan.perl	2002-10-16 11:19:58.000000000 -0500
@@ -47,6 +47,7 @@
 +"rb")
   or die "Cannot open details: $gzerrno";
 my $state = 1;
+my %authors;
 while ($gz->gzreadline($_) > 0) {
   if ($state == 1) {        # in header
     $state = 2 unless /\S/;
@@ -59,6 +60,18 @@
 
   my ($module, $version, $path) = split;
   my_mirror("authors/id/$path");
+  my $authordir = dirname $path;
+  $authors{$authordir} = 1;
+}
+
+foreach my $authordir (keys %authors) {
+  my $path = "authors/id/$authordir/CHECKSUMS";
+  my $source = URI->new_abs($path, $REMOTE)->as_string;
+  my $dest = catfile($LOCAL, $path);
+  mirror($source, $dest);
+  # we use mirror instead of my_mirror because my_mirror presumes a
+  # file is up to date if it exists, but CHECKSUMS will change
+  # contents but not names
 }
 
 ## finally, clean the files we didn't stick there

This is being tested even as I speak. It may not work for you. It may not work for me. It may only work the first time I run it. :)

update: Actually I screwed up and tested a version that called my_mirror instead of mirror. Had a misunderstanding in the call semantics of my_mirror that made the version above not work (fixed). Of course, now you're making connections to check each CHECKSUMS file to see if it changed, which is pretty lame. Maybe someone can come up with a better idea.

Question: why are the my_mirror and clean_unmirrored subroutines wrapped in a BEGIN block? I understand why they are in a block, but why does it have to be compiled first?


comment from the author

merlyn on 2002-10-16T19:27:38

I noticed today though that CPAN.pm is still going to my previous first choice repository to download and compare checksums.
Are you sure you have the latest version? There's code specifically in there to download the CHECKSUMS file to prevent exactly such an action:
    if ($path =~ m{^authors/id}) { # maybe fetch CHECKSUMS
      my $checksum_path =
        URI->new_abs("CHECKSUMS", $remote_uri)->rel($REMOTE);
      if ($path ne $checksum_path) {
        my_mirror($checksum_path, $checksum_might_be_up_to_date);
      }
    }
I added this code because I noticed the same behavior you reported when I was offline (at 30,000 feet, actually {grin}). I do recall some preliminary version of minicpan being circulated about... maybe you picked that up from somewhere.

Yes, you must have the old version

merlyn on 2002-10-16T19:40:13

I just noticed that the Perlmonks version is the buggy preliminary version. Please use the final version instead.

Re:Yes, you must have the old version

jdavidb on 2002-10-16T20:03:54

Thank you! Turns out my solution didn't work, anyway. It went downloaded all the CHECKSUMS files ... then deleted them!

Re:Yes, you must have the old version

merlyn on 2002-10-16T20:37:28

It went downloaded all the CHECKSUMS files ... then deleted them!
Heh! That's exactly what the very next version did for me.

At least you were on the right track. Another 42 minutes or so, and you'd have ended up with my final version.

The key was not running mirror needlessly. I ended up with a multi-stage algorithm, described in the accompanying text. The result is that I don't try to mirror any CHECKSUMS for which I already have a local version and none of its associated files have been updated either, and that I don't ever mirror the CHECKSUMS more than once in a session. That keeps the fetches to a minimum. It's really rather slick in operation. I can update my mini-cpan over a 28.8 connection from a hotel with just a minute or two of overhead, plus the download times of the actual tar.gz files. (Now if only some people would avoid putting entire MP3s in their test suite... {grin}.)

Re:Yes, you must have the old version

brian_d_foy on 2002-10-16T21:21:13

There isn't anyway to avoid putting an entire MP3 in Mac::iTunes because there's no way to put half of one there. :)

Re:Yes, you must have the old version

merlyn on 2002-10-16T21:38:28

It could be a *tiny* MP3 though. 10 seconds of silence or something. {grin}

Re:Yes, you must have the old version

brian_d_foy on 2002-10-16T22:39:10

It could be smaller, but it can't be silence. Besides not adequately testing things, I don't have permission to use silence. John Cage's lawyers recently settled a case over a one-minute track of silence.

Re:Yes, you must have the old version

jdavidb on 2002-10-17T13:23:40

Maybe you can come up with a GPL-ed alternative to silence.