Now that my blog is listed on the Ironman planet, Matt Trout has loaned me his chainsaw and suggested that asking for help here is likely to garner a response, because he says so.
So consider this an official request for assistance to help some overloaded developers fix what I consider to be the most important bug in the toolchain right now.
It's a bug in Archive::Extract, and it's probably not that much work, but neither Jos Boumans nor I have the free time right now to fix it.
The bug in question is that when Archive::Extract uses Archive::Tar to unroll a tarball, it uses the wrong API. Instead of using the memory-efficient streamed extraction API to roll the whole tarball out to disk directly, it instead loads the whole thing into memory and unpacks it from there.
It should probably use code similar to the implementation of Ivor's Archive::Tar::Streamed instead.
http://search.cpan.org/user/ivorw/Archive-Tar-Streamed-0.03/
This is a big problem because once all the memory inflation and memory copying has happened to allow this loading to, a couple of big pathological distributions on CPAN consume almost the entire 2gig memory limit of the (32-bit) process.
This bug is making the performance of CPAN on Win32 much worse and memory-bloaty, but worse is that it takes CPAN::Mini::Visit over the process limit and crashes it, which also means that this bug is currently blocking work on the GreyPAN scanner experiments (Perl::Metrics2), the META.yml database ORDB::CPANMeta, the permissions-aware replacement for the rather unreliable CPANTS dependency graph (CPANTS::Weight, my unified CPANDB SQLite index, and the sorely-needed accuracy fixes for the Top 100 website.
Improving almost all these things require both accurate and 100% complete coverage of minicpan in order to give answers that are good enough to swap out the original first-generation implementations, and this one relatively approachable bug is preventing the ability to reliably reach 100% coverage.
Because this bug also disproportionately impacts Win32 and is a core module, this bug is also very important for the July release of Strawberry Perl, as well as the Perl 5.10.1 release.
If anyone out there has a few hours to attack this bug and fix it, your efforts will have a huge knock-on effect on the quality of many other parts of the CPAN ecosystem.
If you are able to help us out, you can find Jos (kane), myself (Alias), or other that can point you in the right direction in #toolchain on irc.perl.org.
I don't do IRC, so I'm asking here instead. Besides, I gather that this will be interesting information, for other people who are interested in this project.
How do you propose we fix this?
Use Archive::Tar::Streamed. This can be done by changing the code in CPAN.pm.
Don't use Archive::Tar::Streamed, but instead, incorporate its code into CPAN.pm. (The module code is quite short.)
Modify Archive::Tar so it includes the functionality of Archive::Tar::Streamed
Fix Archive::Tar so it transparently includes the desired functionality: so when it extracts the files, it simply copies the bytes from the original tar archive to the new file, and they never get fully loaded into memory.
CPAN.pm could remain unchanged if this is the default in Archive::Tar, which I prefer, or if the default behaviour in Archive::Tar remains unchanged (for backward compatibility), simply add a new option flag in the parameters in some method call, so it avoids loading the entire archive into memory.
I personally prefer the latter.
Re:Your vision
Alias on 2009-06-17T13:50:14
Archive::Tar itself ALREADY contains the required functionality according to Jos, in the ->iter interface.
What's missing is that Archive::Extract isn't actually using the streaming API, it's using the in-memory API.
So Archive::Extract needs to recognise when the Archive::Tar version is high enough to support ->iter and then preferentially use that.
Re:Your vision
bart on 2009-06-17T19:28:24
Urm, yeah, apparently I succeeded in skipping over one of the modules: Archive::Extract: I immediately forgot about it after I read your post, likely because I think I wouldn't have done it that way.
Well, anyway, it's clear now what you expect to be done.