PAUSE on CPAN, indexing $stuff

tsee on 2008-09-05T14:19:52

Most likely, everybody who reads this directly or indirectly depends on the operation of the PAUSE indexer (aka mldistwatch). The PAUSE indexer scans new distributions on the CPAN (really on PAUSE at that point) for the packages/namespaces and associated versions they contain, sends the uploader a friendly message with the results, and adds the information to the metadata that's used by our toolchain when people install modules from the CPAN.

The PAUSE code was written and is still maintained by Andreas König. It's a rather large and unquestionably a rather complex piece of software.

I don't think I'm giving anybody a big surprise if I say that being able to run this same indexer on a given tarball or zip offline may be useful for some toolchain modules. One example would be generating the META.yml provides section.

At the social event of YAPC::EU 2008, Andreas and a posse of other PAUSE admins, including me, sat down to talk about the directions our tools are heading as well as policies. I don't think anybody disagreed to that it'd be great to have components of PAUSE available individually from CPAN. But that's one ambitious goal!

A long time ago, I had spent a significant amount of time on porting the PAUSE indexer code in order to be able to index PAR distributions for injection into a PAR::Repository. I could do all sorts of simplifications for that purpose. For example, .par files are always in ZIP format, no tarballs, etc.
Last night, I decided to give it a shot at making the PAUSE indexer it's own CPAN module.

But I failed.

It turned out to be very, very tightly woven into the whole PAUSE code. I'm really not sure how I got the PAR file scanner to work on the basis of the PAUSE indexer. So I switched to a less ambitious goal: split the PAR indexer out of the PAR::Repository code into the PAR::Indexer module (and distribution) for general consumption.

That's where it stands today. For the future, I figure adding some code back into the mix and making a more generic indexer distribution would get it up to producing 98% of the same results as the real PAUSE indexer. I can do this, but:
Now I'd like to know, would you consider this useful?
And a challenge for all the testing gurus: How would you try to exhaustively test this thing agains the PAUSE indexer?

Cheers,
Steffen


I want to help

brian_d_foy on 2008-09-05T19:34:05

Some of my BackPAN stuff is pulling out bits of PAUSE to index things. Ultimately I want my BackPAN work to do the same job as PAUSE but for a different purpose: make index files for whatever people want to do. (But PAUSE also does all the user management stuff too).

Besides that, I'd eventually like to get a shadow PAUSE running just so there's an extra one lying around ready-to-go.

And, once I do that, I should be able to branch the code and see if there are ways that we can uncouple parts of it. Since we'll have shadow PAUSE running the mainline code, we should be able to do regression testing pretty easily.

If you're interested in pulling out more stuff on PAUSE, maybe we should organize a bit of a hack-a-thon for it. Maybe not in person, but we pick out a feature to see if we can abstract it. Once we do that, maybe we can slowly roll things back into PAUSE. I've looked at the code too, and I think I know how it works, but it's not something that easy to uncouple.

Re:I want to help

tsee on 2008-09-06T00:16:51

That was precisely my impression: not easy to uncouple. Having a second copy of PAUSE running for our hacking pleasure and potentially as a fallback would be great.

I'd be all for hacking on this together, but you're right: Having a development copy of PAUSE would be indispensable.

Re: how about this

Eric Wilhelm on 2008-09-06T03:52:56

I did some ugly things which run the actual mldistwatch code using some monkeypatches and a mock DBI object, etc. Mostly it is a matter of emulating the environment. Perhaps the code could be refactored to make this easier (so you don't need all the uglies.)

And wouldn't it be cool if the mock DBI object could actually be a frontend to query the actual PAUSE via a web service?! Then you could know what is really going to happen if you upload your code right now, etc.

Re: how about this

tsee on 2008-09-06T10:15:21

Or how about having the data in a couple of SQLite files locally for testing?