I've talked about fast compression before, but how about slow compression? Enter the Lempel-Ziv-Markov chain-Algorithm:
$ gunzip perl-5.10.0.tar.gz $ cp perl-5.10.0.tar .. $ time gzip -9 perl-5.10.0.tar real 0m11.490s user 0m11.405s sys 0m0.088s $ cp ../perl-5.10.0.tar . $ time bzip2 -9 perl-5.10.0.tar real 0m17.501s user 0m16.857s sys 0m0.300s $ cp ../perl-5.10.0.tar . $ time lzma -9 perl-5.10.0.tar real 2m0.121s user 1m58.735s sys 0m0.468sSo it's slow. so what?
$ ls -lh perl-5.10.0.tar* -rw-r--r-- 1 acme acme 12M 2008-01-08 13:18 perl-5.10.0.tar.bz2 -rw-r--r-- 1 acme acme 15M 2007-12-18 17:41 perl-5.10.0.tar.gz -rw-r--r-- 1 acme acme 9.4M 2008-01-08 13:19 perl-5.10.0.tar.lzmaAhhh, it compresses better. How about decompression?
$ time gunzip perl-5.10.0.tar.gz real 0m2.014s user 0m0.792s sys 0m0.192s $ rm perl-5.10.0.tar $ time bunzip2 perl-5.10.0.tar.bz2 real 0m6.231s user 0m4.916s sys 0m0.252s $ rm perl-5.10.0.tar $ time unlzma perl-5.10.0.tar.lzma real 0m2.093s user 0m1.752s sys 0m0.216sLZMA compresses well and is pretty fast at decompression. Add another tool to your compression toolbox...
"Development of this algorithm was sponsored by Intel"?
Re:oooh, smaller!
acme on 2008-01-09T07:31:10
It gets worse! Using PAQ, specifically paq8o8 I get:And 30 minutes to decompress. Very small, but very very slow.$ time./paq8o8 -5 perl-5.10.0.tar
real 171m42.046s
user 169m41.464s
sys 0m48.303s
$ ls -lh perl-5.10.0.tar.paq8o8
-rw-r--r-- 1 acme acme 6.2M 2008-01-08 16:44 perl-5.10.0.tar.paq8o8For comparison, the complete history of Perl
mugwumpjism on 2008-01-10T04:10:29
So, I started with over a hundred megabytes of tarballs from history.perl.org, and got those down to 6MB of git pack. Once into the Perforce history, I was looking at reducing the ~400MB of Perforce repository even further. After my initial export, it was already something like 250MB of Git pack (I wrote the exporter to make best use of on-the-fly delta compression). I left a fairly aggressive repack on it going, and it took about 30 minutes and left me with these packs, which are MUCH smaller. The decompression is slower, so some people would probably like to "unroll" their pack to be slightly looser if they were doing a lot of history mining.
Git's compression is able to make a much better job of finding string matches than a straightforward stream compressor - for this reason, I often refer to stream compression as premature compression - as once you have two of these archives laid side by side, they might be able to be represented with 52% of the size that they can as compressed archives.
I've long used 7-Zip when I'm forced to use a Windows system, but I've never used it's native 7z format (LZMA).
From a quick scan of Wikipedia it seems that the 7z format is LZMA compression with a 64-bit header and optional extras and the plain lzma tool as described by you here is a raw LZMA compression stream. They are incompatible in that the two tools can't yet process each others files, which is a shame.
I can see lzma files replacing bzip2 files in my archives now. How much smaller could CPAN be made if we switched to