Finding an extra 10% tarball compression

Alias on 2010-02-16T04:26:44

One of the downsides of Strawberry Perl's move from the InnoSetup .exe installer to the Microsoft native .msi installer was that we had to switch from LZMA compression to the rather less spectacular MSI native compression (which appears to just be deflate or something similar).

Our headline installer went from 17meg to 32meg overnight.

If you were paying (stupidly) close attention to the latest release, you might have noticed that Curtis managed to drop the installer by 3meg, without changing (at all) the compression mechanism and while adding slightly more content to the package.

How?

Via the curious method of just changing the order in which he added the files to the archive, sorting by file extension instead of sorting by file name.

The grouping (even at a naive level) of similar types of content into the same area of the resulting file provided such a good improvement to dictionary efficiency, that it resulting in nearly a 10% improvement over plain deflate (which is almost as good as switching to bz2).

What would be even more awesome would be combining this change with LZMA as well (which builds dictionaries across much bigger areas of the file).

And if you could do it in something less than O(n^2) time, it might also be interesting to test pairs of files directly, to brute-force discover which file order was most efficient for feeding into the compression routine.

Archive::Tar::Optimize anyone?


It wasn't even sorting by the filename.

DiamondInTheRough on 2010-02-16T21:27:19

Windows Installer XML was sorting by the ID I gave the file. Previously, that was a GUID - with results you can imagine. Now, I put the extension, and then a CRC32, into that ID, so it sorts by the extension now.

flashbacks

petdance on 2010-02-26T21:11:08

This reminds me of the Bad Old Days when you'd fret over whether LZH was better than ARJ or ZIP.