LZMA

acme on 2008-01-08T13:32:58

I've talked about fast compression before, but how about slow compression? Enter the Lempel-Ziv-Markov chain-Algorithm:

$ gunzip perl-5.10.0.tar.gz
$ cp perl-5.10.0.tar ..
$ time gzip -9 perl-5.10.0.tar 
real    0m11.490s
user    0m11.405s
sys     0m0.088s
$ cp ../perl-5.10.0.tar .
$ time bzip2 -9 perl-5.10.0.tar
real    0m17.501s
user    0m16.857s
sys     0m0.300s
$ cp ../perl-5.10.0.tar .
$ time lzma -9 perl-5.10.0.tar
real    2m0.121s
user    1m58.735s
sys     0m0.468s
So it's slow. so what?
$ ls -lh perl-5.10.0.tar*
-rw-r--r-- 1 acme acme  12M 2008-01-08 13:18 perl-5.10.0.tar.bz2
-rw-r--r-- 1 acme acme  15M 2007-12-18 17:41 perl-5.10.0.tar.gz
-rw-r--r-- 1 acme acme 9.4M 2008-01-08 13:19 perl-5.10.0.tar.lzma
Ahhh, it compresses better. How about decompression?
$ time gunzip perl-5.10.0.tar.gz 
real    0m2.014s
user    0m0.792s
sys     0m0.192s
$ rm perl-5.10.0.tar
$ time bunzip2 perl-5.10.0.tar.bz2 
real    0m6.231s
user    0m4.916s
sys     0m0.252s
$ rm perl-5.10.0.tar
$ time unlzma perl-5.10.0.tar.lzma 
real    0m2.093s
user    0m1.752s
sys     0m0.216s
LZMA compresses well and is pretty fast at decompression. Add another tool to your compression toolbox...


oooh, smaller!

nicholas on 2008-01-08T13:42:33

"Development of this algorithm was sponsored by Intel"? :-)

Re:oooh, smaller!

acme on 2008-01-09T07:31:10

It gets worse! Using PAQ, specifically paq8o8 I get:

$ time ./paq8o8 -5 perl-5.10.0.tar
real    171m42.046s
user    169m41.464s
sys     0m48.303s
$ ls -lh perl-5.10.0.tar.paq8o8
-rw-r--r-- 1 acme acme 6.2M 2008-01-08 16:44 perl-5.10.0.tar.paq8o8
And 30 minutes to decompress. Very small, but very very slow.

For comparison, the complete history of Perl

mugwumpjism on 2008-01-10T04:10:29

So, I started with over a hundred megabytes of tarballs from history.perl.org, and got those down to 6MB of git pack. Once into the Perforce history, I was looking at reducing the ~400MB of Perforce repository even further. After my initial export, it was already something like 250MB of Git pack (I wrote the exporter to make best use of on-the-fly delta compression). I left a fairly aggressive repack on it going, and it took about 30 minutes and left me with these packs, which are MUCH smaller. The decompression is slower, so some people would probably like to "unroll" their pack to be slightly looser if they were doing a lot of history mining.

Git's compression is able to make a much better job of finding string matches than a straightforward stream compressor - for this reason, I often refer to stream compression as premature compression - as once you have two of these archives laid side by side, they might be able to be represented with 52% of the size that they can as compressed archives.

7Zip

ajt on 2008-01-09T09:23:10

I've long used 7-Zip when I'm forced to use a Windows system, but I've never used it's native 7z format (LZMA).

From a quick scan of Wikipedia it seems that the 7z format is LZMA compression with a 64-bit header and optional extras and the plain lzma tool as described by you here is a raw LZMA compression stream. They are incompatible in that the two tools can't yet process each others files, which is a shame.

I can see lzma files replacing bzip2 files in my archives now. How much smaller could CPAN be made if we switched to .tar.lzma from .tar.gz files then?

lrzip

Ed Avis on 2008-01-25T11:06:17

Have a look at <a href="http://ck.kolivas.org/apps/lrzip/">lrzip</a> which is a combination of LZMA and rzip. That is, it has a preprocessing stage sorting the data somehow and then does LZMA compression.

It doesn't always compress tighter than LZMA but it's usually much faster.

<pre><tt>
% time lzma perl-5.10.0.tar

real 3m33.665s
user 3m31.538s
sys 0m0.530s
% ls -l perl-5.10.0.tar.lzma
-rw------- 1 eda eda 10100884 2008-01-25 10:50 perl-5.10.0.tar.lzma
% time lzma -d perl-5.10.0.tar.lzma

real 0m3.247s
user 0m2.957s
sys 0m0.250s

% time lrzip -q perl-5.10.0.tar

real 1m29.689s
user 1m28.823s
sys 0m0.410s
% ls -l perl-5.10.0.tar.lrz
-rw------- 1 eda eda 10771148 2008-01-25 10:53 perl-5.10.0.tar.lrz
</tt></pre>

So, compression nearly as good, but more than twice as fast. Decompression is fast but not as fast as plain lzma:

<pre><tt>
% rm perl-5.10.0.tar
eda@localhost ~ $ time lrzip -d perl-5.10.0.tar.lrz

real 0m7.519s
user 0m4.274s
sys 0m3.169s
</tt></pre>

You can tweak the settings of lrzip to get different space/speed/memory usage tradeoffs.