Interesting discoveries while working on Top 100

Alias on 2009-02-16T05:13:10

As I try to stretch Algorithm::Dependency further in the Top 100 codebase, one of the more interesting discoveries I've found is that you can generalise just about all lists into one expression.

Alg::Dep works by taking a list of one or more selected vertices, and then walking the (extremely light weight) dependency graph to find all the dependants. Doing that for each vertice as a single-element list, and ranking the number of elements in the result, gives you the Heavy 100 list.

If you invert the graph, you get the Volatile 100 list.

The Debian list is quick a crude filter on the Volatile 100 list, of not really of any interest algorithmically.

What DOES get interesting though, is when you try to do metrics that are dependent not only on the number of dependants, but also on the quality of them.

For example, lets imagine a notional "Above the Law" list. This would be a ranking of distributions which don't meet all the conditions for legal clarity (in Kwalitee this would be the expression "has_humanreadable_license && has_meta_license").

While you COULD derive the metric based on the raw number of downstream dependants, what would be more interesting would be the number of OTHERWISE LEGAL modules that the parent is obstructing.

So a module with 50 dependants all written by the same author with no license scores 0, because fixing that module doesn't fix the legality of any of the dependants.

A module that has 10 or 20 modules below it that are all legal scores higher, because fixing this module clears up the legal ambiguity for many more modules below it.

Similar calculations are interesting for things like minimum Perl version as well, because it lets us identify which modules are standing in the way of back-compatibility the most.

And for author sets, we might also want to filter the dependants by "not the same author" to factor out author-specific clusters.

The neat thing is you can use the same tool to solve all these, and it turns out to be MapReduce.

Once you've walked the dependencies, you get a simple identifier set. You can then map the set by an arbitrary expression (either "not author XXX" or "minimum version of Perl is lower than Y.ZZ") and then do an addition-reduction to get the dependant count.

You can do similar things by applying a vertice metric for, say, the total number of significant PPI tokens in a distribution.

This would let us refine the Volatile and Heavy lists by factoring in the actual volume of code to the calculation, and help us further prioritise the dependencies of big things like bioperl ahead of the dependencies of minor conveniences.

That ALL of these problems can generalise to Dependencies + Map + Reduce makes not only the code abstraction simpler, but also means there's some great opportunities here for parallelism, caching (at the map operation level) and other interesting strategies.