CPAN Size

acme on 2003-03-07T10:52:52

CPAN is 1.5G. That's how much diskspace will be taken up if you host a CPAN mirror. This morning I uncompressed every .tar.gz and .zip on CPAN for a laugh. The resulting, uncompressed mass of directories takes up 1.7G. Looks like compression isn't helping a great deal and tar and zip are mostly only good for packaging purposes.

Mmm, more stats...


Facinating, anyone for bz2?

ajt on 2003-03-07T11:28:46

I'm amazed by your finding that an uncompressed CPAN is only 13% larger than the compressed version. I would have thought that anything text based like a Perl module should compress very well, even with ZIP or tar.gz.

I wonder what is taking all the space up and is uncompressible?

I know in the cygwin world bzip2 is very popular, and I've wondered if going forward it would be useful for CPAN or future CPAN to support it as well, to squeeze a little more compression in.

Re:Facinating, anyone for bz2?

hfb on 2003-03-07T14:53:12

thousands of tiny little files...

Re:Facinating, anyone for bz2?

TorgoX on 2003-03-08T04:07:09

But a .tar.gz doesn't feed the gzip algorithm thousands of tiny little files; it's just one file by that time.

Re:Facinating, anyone for bz2?

hfb on 2003-03-08T07:06:15

no, but there isn't much compression to be had in a 3k file no matter what algorithm you use. I recall that the average filesize is around 50k....gz, bz, Z ...it'll all be much the same result.

Re:Facinating, anyone for bz2?

Elian on 2003-03-07T16:12:19

Elaine's right, and bz2 won't help here. Maybe you'd squeeze things down by another 1%. Maybe. There's a lot of overhead to small files--modern compression programs work better the larger their input, and perl modules just aren't that big. There's also a lot of uncompressable overhead in the tar file structure information.

If you wanted to compress perl modules better, you'd want a denser file packing scheme than tar, and build a compression scheme that was prepopulated with a lot of the common perl substrings already, so they didn't need to be discovered at compression time. You might see 20-25% compression at that point, if you got really lucky.

Re:Facinating, anyone for bz2?

ajt on 2003-03-07T18:46:54

While I agree that bz2 or someother compressor isn't going to fix the problem, I do find that on a tar of text files, it's quite a bit more than 1% efficient than gzip.

I can't comment on replacing the tar structure, but I've seen comments on it's weaknesses in other places too.

I'm still amazed at how little compression there is in CPAN, the latest module I've uploaded for example, shrank from 90kb to 24kb with gzip (22kb with bz2). What is in there that doesn't compress?

Re:Facinating, anyone for bz2?

Elian on 2003-03-07T19:12:25

Look at the size of many of the files on CPAN. I don't have the space to slurp the whole thing down for analysis, but a quick scan through shows that a huge number of the archives are tiny--less than 15K. Lots of them are less than 10K. Thats of a size where compressors just don't have enough to work with to make much of a difference, so it doesn't matter what compressor you're using, as there isn't enough there to compress at all usefully.

It's not that the data on CPAN is oddly uncompressible. It's that the data's in such small chunks that the compressors run out of data before they can actually do much with it.

To be fair...

belg4mit on 2003-03-07T17:19:31

A module's source is not the size of a module.
In particular, the size of a moderately complicated binary (XS module) is significantly larger than the source.

Also, what if you only take the latest (or latest two) versions of any given module? A lot of authors haven't heard that BackPAN exists, and that the Master Librarian would like to see things under 700 MB.

Blame

pudge on 2003-03-11T16:33:50

Top 10:
124516  G/GR/GRAHAMC
82364   J/JH/JHI
69932   G/GS/GSAR
63344   C/CN/CNANDOR
31616   N/NI/NI-S
31588   I/IL/ILYAZ
28928   K/KR/KRISHPL
25244   T/TI/TIMB
24788   L/LD/LDS
20228   B/BI/BIRNEY
All of the above, though, have perl distributions (or documentation distributions).