Day 1318: PPI::Metric and thoughts of what is to come

Alias on 2005-07-15T19:02:49

So a (working) week has passed since the PPI 1.000 release, we've done a couple of point releases to fix some small issues, and I'm starting to get some idea of where the most developer interest is, and the areas that need to be addressed next.

The most immediate activity seems to be in the area of caching and indexing of various things. This makes quite a bit of sense. It's hard to build anything that usefully scans large numbers of documents when you can't manage or cache the resulting data.

So in my uncomfortably new benevolant dictator clothing, here's how we are going to do it.

Firstly, unless you have a compelling reason otherwise, I want to see these various indexing stores done using SQLite with Class::DBI wrappers. This is simple, works for pretty much everyone, and can easily be embedded inside of things like editors.

This should also provide enough power to get things done, while preventing dependency bloat due to everyone using their own favourite and different ways to store things.

As an example of this, it's probably time to look at some form of PPI::Metric system properly.

Currently, I'm thinking this should take the form of an aforementioned SQLite/C:DBI wrapper with the primary table containing an MD5 of the document, a metric identifier, and the value of the metric.

A metric identifier is just a globally unique descriptor of the particular thing you are testing. To help keep this namespace managable, we're going to base the metric identifier on the name of any CPAN (or otherwise) class you own, followed by a dot-seperated metric name.

So for example, the total number of tokens with length less than 5 in a document would be represented with a metric entry like this.

md5sum: (MD5 hash for the document) metric: PPI::Metric::Silly.short_tokens value: 135

A second table will be used to link the actual files on disk or elsewhere to the actual metric data like so.

file: /home/foo/example.pm changed: (epoch time) md5sum: (MD5 hash for the document)

Doing it this way should allow multiple different things to read and write entries, keep background and/or parallel indexers relatively sane, and allow collection and caching of a vast number of different metrics in the same store.

Thoughts? Comments? Volunteers? :)