My Oslo QA Hackathon Wrapup

brian_d_foy on 2008-04-11T17:32:32

I'm back from Oslo and back to normal life. One of the biggest benefits of the hackathon was simply setting aside that time to work on something and not think about anything else. Linpro provided the space, network (and network support), food, tram passes, and other things. I just had to show up each day. The Perl Foundation covered my plane flight, and LogicLab took care of my hotel expenses.

I laid out my goals in earlier use.perl post: I wanted to work on Module::Release so it could test multiple perls in one run, and work on my index of BackPAN.

I finished the Module::Release stuff the first day by getting up early. While working on that, I was mostly sitting across from Adam Kennedy, who has his own release.pl. He does things very differently: I start in a working directory and release from there while he starts anywhere he likes, creates a working directory, and generates all his module internals on the fly. I like some of his features, so I pulled some of them into Module::Release so I can have them in the next version. H. Merijn Brand started using Module::Release to test his Text::CSV_XS modules, and he provided a lot of feedback on things he'd like to have. Most of these were cleanups to make the process work smoother for everyone. Since my release script is mostly how I do things, I didn't cover the cases that didn't bother me, such as leaving files behind after a run. On the plane ride home I started moving some of the functionality around to make the major functional groups into modules. A lot of the core stuff deals with CVS and Sourceforge, which not many people use anymore. I'm working on a lightweight mixin process to load plugins. I've actually had that for awhile, but I just have to make it releasable.

My big project was indexing BackPAN. I got quite a bit of code under control, but only about 10% of the way through BackPAN. My goal is for anyone to be able to go from an installed module file and find which distribution it came from. I'll collect file meta-data and code signatures (e.g. $VERSION, etc) to connect the installed files to the distributions. I released Module::Extract::Namespaces and Module::Extract::VERSION as functional units for the process. The rest of the code is ugly and cobbled at the moment, but I should be able to clean that up soon. I was able to index about 2,000 distributions for dual-lived modules and only run into about 200 errors. That sounds like a lot of errors, but usually an entire chain of distros will fail. Most of the CPAN.pm distros don't like my indexer, for instance, so every CPAN.pm distro fails. The other failures are related to operating system dependencies or perl compilation options such as thread support.

My BackPAN indexer tries each distro in a separate process. Adam Kennedy passed on a lot of wisdom about using PPI for some of this. Sometimes PPI can spin out of control in very odd situations, for instance. I was already thinking about forking to handle the indexing in parallel, so I added that to isolate separate runs of using the PPI too. I can use alarm to shut down any runs that take too long. Initially I thought a 15 second alarm would be generous, but the Encode modules needed more time. We also talked about caching PPI's Perl-DOMs, and that's a feature in PPI, but I'd like to make those caches part of the index. No need to parse stuff yourself, just load the stored cache. Adam thinks this might not be portable.

The output of the indexer is just a YAML dump of everything it recorded. I'll worry about a database server later, but even then it will probably just be sqlite.

I got most of the coding for the indexing done, although not the actual full indexing. I still have to deal with a couple common situations, such as distros using Module::Install. I'm working on a module, Distribution::Guess::BuildSystem, which will help the indexer figure out what might happen when it tries a build so it can delegate the indexing to something that knows how to handle that system, including any oddities it might have.

I'll clean up the code and turn it loose on the full BackPAN. I also want to look at running it in some virtual machines so I can pick up the operating system specific modules too.

David Golden and rjbs were talking about a CPAN Metabase as a store of any sort of data that we can collect, so I figure the BackPAN data will get in there too.

Finally, I'm going to try to pull all of this together for a Perl Mongers talk, so I should have slides and a perlcast soon. :)

distribution guessing stuff

Alias on 2008-04-11T18:31:24

You might want to take a look in my repository at Module-Inspector, which does some of the sorts of things you are trying to do.