Cataloging BackPAN: MiniCPAN done in 9 hours

brian_d_foy on 2008-09-06T18:21:36

My BackPAN indexer (YAPC::EU 2008 slides) made it's first complete pass through my MiniCPAN yesterday:

  • Distributions processed: 16039
  • Indexing failures: 782 (4.8%)
  • Run time: 9 hours (0.49 dists / sec)


The total size of BackPAN is about 100,000 distributions, so I think this means that I could index all of BackPAN in less than a week.

Right now I output everything as YAML, one YAML file per distribution. The data organization is sloppy and sometimes redundant because I haven't paid attention to it. You can get the tarball of all 16,000 files. Take a look to see if there might be anything else you'd want the indexer to record about a distribution. If you're interested in making some sort of CPAN service, let me do the work of cataloging the information you need.

If you want to play with this, get MyCPAN::Indexer from CPAN Search, or if you want to play with everything, checkout the sources from Github. You probably can't install in from CPAN since it depends on a couple of modules which only have developer releases right now.

The thing you'd want to play with is examples/backpan_indexer.pl. It's a little messy right now because I bolted on a Tk interface (see video one or two) that lives in examples/tk.pl and a dispatcher that lives in examples/steak.pl. My next step is to make those pluggable modules so you can note in the configuration file which interface and dispatcher you want, and as long as they have the right interface, they'll do whatever they do.

After a little bit more work on the indexing stuff, the next step is to take all of those YAML files and distill them into something that is easier to search, then hook up some sort of search interface to them. I'll probably first write a command-line tool (although with wonderful MVCness). I want to feed the index any file in @INC and get a report:

$ cpan_index `perldoc -l Foo`
Foo.pm's fingerprint found in Foo-Bar-0.05.tgz
	Author: Joe Snuffy (SNUFFY@cpan.org)
	Release date: Nov 11, 1998, 23:59:59
	Version: 0.05
	Latest version on CPAN: Foo-Bar-0.06.tgz
	Current maintainers: 
		Joe Snuffy (SNUFFY@cpan.org)  (first come)
		Joe Cool (CAMEL@cpan.org)     (co-maintainer)
	Also came with:
		!!!Bar.pm, installed version 0.08 (does not match Bar.pm from Foo-Bar-0.05.tgz)
		ABC.pm, installed version 0.05 (matches ABC.pm in Foo-Bar-0.05.tgz)
	Depends on:
		Baz.pm from Baz-0.67.tgz
		Quux.pm from Quux-0.01.tgz
	CPAN Testers Matrix: ...
	Release history:
		0.01  Dec 31, 1969, 23:59:59  SNUFFY  (BackPAN)
		0.02  Jan 31, 1995, 23:59:59  SNUFFY  (BackPAN)
		0.03  Jun 6,  1996, 23:59:59  SNUFFY  (BackPAN)
		0.04  Oct 31, 1997, 23:59:59  SNUFFY  (BackPAN)
	****0.05  Nov 11, 1998, 23:59:59  SNUFFY  (CPAN)
		0.06  Sep  5, 2008, 23:59:59  CAMEL   (CPAN)	


Vimeo Links

Ovid on 2008-09-06T20:02:43

The Vimeo links don't work. I'm getting "Video not found".

That being said, this is really great work.

Re:Vimeo Links

brian_d_foy on 2008-09-07T16:07:59

I posted once I uploaded, and vimeo needs a chance to process them. They are a bit weird about when they actually make them available. The same links should work now.

video

andy.sh on 2008-09-07T05:58:33

By the way, yapc.tv's video went online this (European) night.

Re:video

brian_d_foy on 2008-09-07T16:10:23

Woo hoo! Thanks!

Now YAPC.tv needs a little badge that I can put on my web page next to the other details for the talk. I know it would be a lot of work, but a YAPC.tv logo in the corner of the video would be sweet too. :)

Re:video

andy.sh on 2008-10-25T10:35:35

Each page now contains an example of HTML code which may be used to put it on person's website to link to the talk.

Re:video

brian_d_foy on 2008-11-25T15:25:32

Ah, I meant a cool "YAPC.tv" logo that says "YAPC.tv". I was also thinking that having an overlay on the actual video that says "YAPC.tv" woudl be nice. Maybe people wouldn't like that on their videos, but I don't mind promoting the project since you did the work. :)

Re:video

andy.sh on 2008-11-25T15:43:13

In fact, I did not decide yet whether we need the logo or not :-)