CPAN Size (Take two)

acme on 2003-06-01T13:50:30

A while ago I commented that CPAN was about the same size uncompressed as it was compressed. I guessed that this was because most distributions were small so that compression didn't help much. Inspired by some detecting-charsets-using-compression talk on IRC, I added packed and unpacked size to CPANTS.

For example, Acme-Buffy-1.3.tar.gz takes 2,381 bytes compressed, 5,170 bytes uncompressed (the total of all the file sizes, assuming directories are free).

The top 10 biggest compressed packages on CPAN are (ignoring Perl and Parrot): Unicode-Unihan-0.02.tar.gz (4,513,673), Chart-2.2.tar.gz (4,405,514), Harvey-1.02.1.tar.gz (4,358,848), Tk-800.024.tar.gz (3,489,636), bioperl-1.2.1.tar.gz (3,488,040), bioperl-1.2.tar.gz (3,425,575), Tk800.015.tar.gz (3,330,861), Lingua-ZH-CCDICT-0.02.tar.gz (2,704,573), bioperl-1.0.2.tar.gz (2,645,781), bioperl-1.0.tar.gz (2,547,171).

The top 10 biggest uncompressed packages on CPAN are (ignoring Perl and Parrot): Harvey-1.02.1.tar.gz (93,673,151), Net-SCP-Expect-0.09.tar.gz (43,253,713), Chart-2.2.tar.gz (15,322,596), Tk-800.024.tar.gz (14,959,413), Tk800.015.tar.gz (14,217,659), Unicode-Unihan-0.02.tar.gz (12,924,673), bioperl-1.2.1.tar.gz (12,555,654), Lingua-ZH-CCDICT-0.02.tar.gz (12,508,332), bioperl-1.2.tar.gz (12,245,838), DBIx-DBStag-0.01.tar.gz (11,189,211).

Compression does help a lot in some cases. For example, Net::SCP::Expect has a 40M test file which consists of the words "This is the small file For use in testing Net::SCP::Expect only Delete at your convenience" over and over. The distribution itself compresses down to a mere 159,500 bytes.

Only one distribution is actually bigger than its packed version: Bundle-Tk_OS2src-1.00.tar.gz is 576 bytes packed, 548 bytes unpacked.

Does a high compressibility have any relation to how "good" a module is? Make your own mind up. The top 10 modules that compress badly: Bundle-Tk_OS2src-1.00.tar.gz (1.0511), Image-Magick-Thumbnail-0.01.tar.gz (0.9620), Wx-Sample-XS-0.01.tar.gz (0.9583), File-Find-Rule-MMagic-0.02.tar.gz (0.9461), StatsView-1.4.tar.gz (0.9290), File-Find-Rule-ImageSize-0.03.tar.gz (0.9283), Image-Maps-Plot-FromLatLong-0.1.tar.gz (0.9264), Bundle-Expect-1.09.tar.gz (0.9187), Image-Density-0.1.tar.gz (0.9088), HTTP-Lite-2.1.4.tar.gz (0.9054).

And the top 10 distributions that compress well: Net-SCP-Expect-0.09.tar.gz (0.0037), Class-Skin-0.05.tar.gz (0.0200), Acme-Ook-0.10.tar.gz (0.0315), Parse-Nibbler-1.10.tar.gz (0.0382), DBIx-DBStag-0.01.tar.gz (0.0426), Harvey-1.02.1.tar.gz (0.0465), CSS-1.01.tar.gz (0.0470), SuperPython-0.91.tar.gz (0.0499), Config-ApacheFormat-1.1.tar.gz (0.0600), EasyArgs-1.00.tar.gz (0.0633), Cisco-ShowIPRoute-Parser-1.01.tar.gz (0.0650).

What does this all mean? Dunno, don't ask me, I just make the numbers up ;-)