WE CAN HAZ (CPAN) DATA!!!

Alias on 2009-01-15T13:54:30

I've been chasing an idea for a few years now without a huge amount of success.

The idea is to take all the data from all the different bits of CPAN and tie them together effectively, to have CPAN namespaces that fundamentally represent a data source rather than code.

My not-particularly-successful Data::Package class was an attempt to move in this direction, but never really got very far. And my repository has a couple of stillborn attempts at something like CPAN::Dataset in it.

The beginning of a new solution arrived with the idea that you can fairly easily post a copy of your data in the form of a compressed SQLite database.

ORLite::Mirror solved the actual functionality needed to create a class for a remote SQLite database.

But building an actual distribution for CPANTS or CPAN Testers was still not particularly economical because you don't CONTROL that data, and it's hard to put in the work to write the test and documentation when the dataset might change on it and you would need to change the documentation as well.

The still-forming ORLite::Pod enhancement to ORLite seems to add the additional automation that reduces the cost and effort required to produce and maintain a client-side ORM for someone else's data to the point where it becomes quite easy to just throw together a distribution for any dataset you can get a URL for.

To kick the process off, I've created a couple of new distributions in a new ORDB namespace (to hold these ORLite-based remote SQLite databases).

ORDB::CPANTesters is a simple single-table ORDB that lets you search on the CPAN Testers database.

ORDB::CPANTS is a multi-table ORDB that lets you work with the CPANTS data.

I've uploaded these distributions, despite ORLite::Pod not being fully completed yet, so I can get a better idea of how this all works in practice.

You are welcome to download and try these distributions out, just be warned that because the classes are code-generated from the SQLite database itself, the modules will need to pull the databases (the CPANTesters is almost 100meg) in order to compile the module.

In the mean time, what else can you think of that I can wrap a module around? I've got to get ahead of ZOFFIX on the leaderboard again somehow :)


The CPAN Testers DB

barbie on 2009-01-15T16:13:20

.. is at http://devel.cpantesters.org/cpanstats.db.bz2 (gzip version also available).

You're using the old database which contains LOTS of missing data. :)

Any better formats than bz2?

Alias on 2009-01-16T00:04:43

If we're going to do something other than gzip, you'd maybe be better to look at an .ls archive. I see 30% size reductions over .gz for SQLite databases.

Re:Any better formats than bz2?

Alias on 2009-01-16T00:43:44

That should be .lz

Re:Any better formats than bz2?

Alias on 2009-01-16T00:45:01

Or course, ORLite::Mirror doesn't support LZMA yet, but I'll work on that.

Re:Any better formats than bz2?

Aristotle on 2009-01-16T07:43:16

The difference is that both Linux distros and MacOS X systems ship with bzip2 tools. There is no OS at all that ships with LZMA tools.