Ego mining CPAN data

autarch on 2007-09-21T19:20:27

The other day, I was wondering what percentage of CPAN I have sent patches for. I was kind of hoping for a nice impressive number like 1%.

I wrote a little script that takes a local CPAN mirror (courtesy of CPAN::Mini) and extracts the latest version of every module looking for my name or email address in files that look like changelogs. This obviously gets more than patches, since in some cases I just submitted a bug report or suggestion.

It's not quite perfect since some CPAN authors will say something like "applied patch from RT12345" without a name. I didn't want to fetch all those different tickets, since that'd take a long time.

So the list I came up with was this:

Apache::Compress, Apache::Filter, Apache::Session, App::Info, Catalyst::Action::REST, Class::Container, Class::Validating, CPAN, CPANPLUS, Data::Structure::Util, DateTime::Calendar::Chinese, DateTime::Calendar::Discordian, DateTime::Calendar::Hebrew, DateTime::Calendar::Julian, DateTime::Event::Recurrence, DateTime::Format::Duration, DateTime::Format::Natural, DateTime::Format::Strptime, DateTime::Incomplete, DateTime::Set, DateTime::Span::Birthdate, DBD::mysql, DBD::Pg, DBI, Devel::Cover, Email::Address, Exception::Class::TryCatch, ExtUtils::ModuleMaker, GD::SecurityImage, GraphViz, HTML::FillInForm, HTML::Tidy, IO::All, IPC::Shareable, Kwiki, Lingua::ZH::PinyinConvert, Log::Dispatch::Config, Log::Log4perl, mod_perl, Module::Build, Module::Signature, Net::SFTP::Foreign, Pod::Coverage, Set::Infinite, Spiffy, Spoon, Storable, SVK, SVN::Web, TAP::Parser, Test::Simple, Test::Taint, Thread::Pool, XML::Atom, XML::SAX::Expat, XML::SAX::Writer

It was fun to do this because I found a few cases where I'd totally forgotten having been involved.

This isn't quite 1%, closer to 0.5% (57 modules out of 12208). Of course, if you count the modules I've personally released, I end up with 97 modules, closer to 1% but still not quite there.

BTW, my original goal was to build a database of who patched what, but parsing out the bazillion ways someone says "patch from so-and-so" is really hard, and the RT thing is still a big problem. There's also a problem just figuring out identity, since people end up referred to in many ways, by full name, first name ("patch from Stas"), email address, and nicknames like CPAN ids or IRC nicks.

It'd be pretty cool to get that data, though, since we could see things like modules with the most patchers, most patches, most frequent patchers, etc. Maybe I'll get back to this sometime.