I got this idea from a thread on comp.lang.perl.misc (and notably some encouraging from Janek Schleicher who considered the idea very cool): Creating a module that allows storing of data zlib-compressed.
I initially thought this could be done via tie()
but Perl's tying interface is too limited to do that effectively. For scalars you only have FETCH()
and STORE()
. This defeats the purpose of compression for example in the following code:
$string = "string" x 1_000_000;
print substr $string, 1, 1;
Obviously, via tie()
this would result in uncompressing the whole data in memory. It would also be very slow.
The obvious solution therefore is (apart from adding SUBSTR
and all the other string-operators to the tie-interface) a class of its own with a little bit of overloading of ""
, .=
etc.
It sounds much more trivial than it is as I had to realize. I started hacking away the XS part till I could at least store and get the data. The string becomes a linked list of buffers with the original large string divided into CHUNK_SIZE-large pieces which are then compressed into the aforementioned buffers. After that I was eager to do a little benchmark:
my $uncompressed;
my $compressed = String::Compress->new;
cmpthese (-2, {
compressed => sub {
$compressed->store("hallo" x 1023);
my $d = $compressed->get;
},
uncompressed => sub {
$uncompressed = "hallo" x 1023;
my $d = $uncompressed;
},
});
Urmmh, here's the embarrassing part now:
compressed: 5 wallclock secs ( 1.02 usr + 1.18 sys = 2.20 CPU) @ 509.09/s (n=1120)
uncompressed: 4 wallclock secs ( 2.05 usr + 0.00 sys = 2.05 CPU) @ 40707.32/s (n=83450)
Rate compressed uncompressed
compressed 509/s -- -99%
uncompressed 40707/s 7896% --
So it's slightly slower.
On the other hand, "hallo" x 1_000_000
eats about half the memory an ordinary Perl scalar would need. When increasing CHUNK_SIZE to a real large value such as 500_000 (it's just 4096 right now) it could probably be further dropped to less than 10kb (for a repetitive string like the above only, of course).
But my actual concern is something else: I reimplement the string operators as methods which is at least feasible for thinks like chomp
, substr
etc. But what about regular expressions? I'd need to reimplement Perl's RE-engine (working on segmented compressed little strings which form one large string!). I think I'll leave that to someone else (Janek perhaps:-).
Re:Magic
ethan on 2003-02-19T14:49:35
I wonder whether it's possible to implement this using magic. Look up PERL_MAGIC_uvar in the perlguts manpage to see what I mean.
I am not sure whether the U magic is powerful enough. Theufuncs
struct simply contains a pointer to a get and set function. The third member,uf_index
, is just anIV
that doesn't seem to be used for anything else other than as an identifier (that is what grepping through the 5.8.0 sources suggests).
There is a whole mot more of magic available, but I am not sure whether I am supposed to diddle with it. For instance, there isvtbl_susbtr
. So am I allowed to to take anSV
and decorate it with my own customMAGIC
structures? And then addingMAGIC->mg_moremagic
if I so wish?
I am rather reluctant as to that because Perl's magic is so thinnly documented. perlguts seems to imply that only 'U' and '~' is available for extensions.
But if I am free to roll my own set of magics and attach it to anSV
it would be much cooler since then a scalar could really be used like an ordinary variable without the limitations oftie()
.