Announcing the "CPANTS Heavy 100" index

Alias on 2009-02-11T05:21:01

With the success of ORLite and ORDB::CPANTS I've finally managed to achieve something I've wanted for years, a cheap and well encapsulated way to screw around with CPAN graph data.

This has been possible for a long time, but I think I've finally found the solution that can do it in a Closed Problem way, and with each piece separately being a working, completed and published module.

This means I don't have to look after some random script on a website somewhere, and it lets everyone else take my work and maintain it for me :)

By combining ORDB::CPANTS with Algorithm::Dependency, this also means that now I can finally achieve a dependency-weighting engine for the CPAN dataset that is self-updating and requires basically no maintenance.

This in turn gives me the opportunity to fix one of the CPAN artifacts that I've disliked for a long time, the Phalanx 100. What I dislike about it the most is that it is just so arbitrary.

It's in the right solution area, but it is ultimately edited by humans, and it isn't updated in real-time (so it doesn't respond to CPAN usage trends).

So my plan is to "upgrade" the Phalanx 100 into a range of "Top 100" indexes that are automatically-generated, updated daily, and can be used as the basis for optimising and prioritising QA work.

I hope to release one of these new indexes every few days, with supporting code released to CPAN shortly after. As this list of lists starts to grow, I'd like to create a dedicated website ( which I'll notionally call http://top100.cpan.org/ ) to hold all the indexes.

To kick off the indexes, I'll start with the "CPANTS Heavy 100".

This is an index containing the 100 CPAN distributions with the largest dependency chains. These represent excellent sample cases for testing scenarios relating to typical large scale Perl applications in the wild.

This index, however, makes no judgement whatsoever about any of the members of the index being good, bad, or otherwise. It is purely a naive graph calculation (which is why this list is dominated by plugins for other things that are themselves heavy). CPANTS Heavy 100 748 Task-POE-All 276 MojoMojo-Formatter-RSS 271 MojoMojo-Formatter-Amazon 269 MojoMojo-Formatter-Emote 266 MojoMojo 216 Task-Padre-Plugin-Deps 211 Task-BeLike-RJBS 206 Parley 205 Foorum 203 Angerwhale 200 Task-Padre-Plugins 198 Task-Catalyst-Tutorial 196 Task-Email-PEP-All 191 Jifty-Plugin-ModelMap 189 Jifty-Plugin-Authentication-Bitcard 188 Jifty-Plugin-GoogleAnalytics 188 CommitBit 187 JiftyX-ModelHelpers 187 Jifty-Plugin-JapaneseNotification 186 Jifty 185 Reaction 181 Buscador 179 Rose-DBx-Garden-Catalyst 171 Module-CPANTS-Site 170 Task-CatInABox 168 Egg-Release-Authorize 164 Egg-Plugin-SessionKit 162 Catalyst-Controller-HTML-FormFu 159 Osgood-Server 158 Catalyst-Example-InstantCRUD 158 App-CamelPKI 157 Egg-Release-DBIC 151 Catalyst-Controller-Atompub 149 Task-SOSA 144 Padre-Plugin-CSS 142 Padre-Plugin-Perl6 142 Padre-Plugin-AcmePlayCode 142 CatalystX-CRUD-YUI 140 App-HistHub 139 Egg-Plugin-Crypt-CBC 138 Apache-SWIT-Security 138 Handel-Storage-RDBO 137 Egg-Release-DBI 137 Apache-SWIT 137 Test-Apocalypse 136 Egg-Release-XML-FeedPP 136 Egg-Plugin-Cache-UA 136 Egg-Release-JSON 136 Egg-Release-Mail 135 ShipIt-Step-Manifest 135 ShipIt-Step-DistClean 135 ShipIt-Step-ApplyYAMLChangeLogVersion 135 Egg-Plugin-Authen-Captcha 135 Egg-Plugin-Net-Ping 134 Catalyst-Helper-AuthDBIC 134 Dist-Joseki 134 Egg-Plugin-LWP 134 Egg-Plugin-Net-Scan 134 Egg-View-TT 134 Egg-Model-Cache 134 Egg-Model-FsaveDate 134 Egg-Plugin-Log-Syslog 133 Egg-Release 133 Padre-Plugin-HTML 133 Padre-Plugin-PerlCritic 132 Task-Email-PEP-NoStore 132 Padre-Plugin-InstallPARDist 131 CatalystX-ListFramework-Builder 131 Padre-Plugin-XML 130 HTML-FormFu-Model-DBIC 129 Catalyst-Authentication-Credential-OpenID 129 DBIx-Class-HTML-FormFu 129 Padre-Plugin-PAR 128 Padre-Plugin-Encrypt 128 Devel-ebug-HTTP 128 Padre-Plugin-JavaScript 127 Padre-Plugin-HTMLExport 127 Padre-Plugin-SpellCheck 127 Padre-Plugin-ViewInBrowser 127 Padre-Plugin-PerlTidy 127 Padre-Plugin-Alarm 126 DBIx-Class-FromValidators 126 Padre-Plugin-Parrot 126 Padre-Plugin-CommandLine 126 Padre-Plugin-Vi 126 Padre-Plugin-Encode 125 Padre 124 Catalyst-Controller-LeakTracker 124 cnutt-feed 124 Catalyst-Model-HTML-FormFu 124 Catalyst-Controller-DBIC-API 123 DBIx-Class-Schema-PopulateMore 123 Catalyst-Authentication-Store-KiokuDB 122 Catalyst-Plugin-Session-Store-KiokuDB 121 KiokuX-User 121 Pod-Browser 121 Titanium 121 Data-Conveyor 120 Rubric-Entry-Formatter-Markdown 119 Handel


History lessons

petdance on 2009-02-11T06:33:14

First, I'm glad to see you doing this sort of thing. Automated CPAN analysis is good to have.

I'd like to correct a few notes on the Phalanx 100, though.

First, consider why the Phalanx 100 was created. The Phalanx project was an attempt to increase test coverage in the most-used modules on CPAN, so that Ponie would have a good test base to work with.

The Phalanx 100 was created by analysis of CPAN download logs for a one-month period from one mirror. We figured that would be a good enough estimate of "most-used."

The only human editing was creating the "special testing squad" of modules, since this was ultimately a testing project, and to remove two or three very specific modules that we judged to be too out there for people to work on.

I'm glad that people have found use from the Phalanx 100, although it hasn't been updated in years. But I'm also not surprised that it doesn't suit your purposes.

It's kinda like Frank says to Janet about Rocky, "I didn't make him... for YOU," and the audience says "But she gets him anywaaaaay."

Re:History lessons

Alias on 2009-02-11T11:00:36

At the time that the Phalanx 100 was created, my specific beef was that it didn't appear to factor in dependencies.

So while we got a list of 100 modules, they weren't ACTUALLY the most 100 used, just the 100 most in some other sense.

I do, however, appreciate that they were based on usage data, as opposed to dependency data. And I totally plan to start factoring that into some of the indexes, once I've got the basic naive ones working.

Re:History lessons

petdance on 2009-02-11T13:52:59

I guess I take issue with your "beef" because it was never intended for your use. We didn't make any assertions as to how the data should be used, so it's not fair for you to say it's not what you want.

Our feeling on dependencies was that dependencies would have to get downloaded, too, and so those downloads would show that traffic. So you get dependencies in that data, but not weighted by the number of other modules that use the dependency. A single-use dependency would get as much weight as, say, HTML::Parser.

Yay

acme on 2009-02-11T08:47:34

Now that would be a perfect list for testing a CPAN packaging system...

Can you publish the dependency chains?

jesse on 2009-02-13T22:06:16

I have some equivalent tools that crawl the packages themselves. I just ran my dependency chain tool for MojoMojo and come up with 239 deps rather than your 266. I'd be very curious to see what the discrepancy is.

most depended on

joeaguy on 2009-02-19T01:07:57

It would be good to have a few different measures of module popularity. Personally, I think a listof the "most depended on" modules would be really useful. That some other module author would use a module I think is a pretty good vote of confidence for the usefulness and quality of that module. Such rankings would especially help when trying to choose between roughly equivalent modules for a project.