Holy cow!
knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ perl -I lib readmail headers.msg just started : 1360 28328 after require File::Slurp : 2228 28704 after slurping : 12192 38656 after require Email::Simple : 12248 38656 after construction : 129368 159816
If you try to build an Email::Simple from a message with 10,000 unique headers, be prepared to give up some RAM. It's not quite as bad if you just have 10,000 values for one header:
knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ perl -I lib readmail oneheader.msg just started : 1360 28328 after require File::Slurp : 2228 28704 after slurping : 11040 37504 after require Email::Simple : 11096 37504 after construction : 87380 114852
Still, this is nuts...
knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ ls -l *head*msg -rw-r--r-- 1 rjbs rjbs 5092896 Nov 28 08:29 headers.msg -rw-r--r-- 1 rjbs rjbs 4504001 Nov 28 08:56 oneheader.msg
It should not take 100 MB to store five meg message in an object!
The latest trunk of Email::Simple makes this just a bit better:
knight!rjbs:~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg just started : 1364 28328 after require File::Slurp : 2232 28704 after slurping : 12196 38656 after require Email::Simple : 12260 38656 after construction : 114444 144892 knight!rjbs:~/code/pep/Email-Simple/trunk$ perl -I lib readmail oneheader.msg just started : 1364 28328 after require File::Slurp : 2232 28704 after slurping : 11044 37504 after require Email::Simple : 11112 37504 after construction : 74184 101656
That's hardly a big improvement, though. Reading the message in as a reference, a new feature in the trunk, doesn't help much either: the memory usage isn't in the body storage or copying, it's in the data structures used to store the header. Despite the massive data structure, I decided that my first "simple" fix would be to replace this code:
for (split /$mycrlf/, $$head) { # ... parse the field ... }
With this "better" code:
my $head_txt = $$head; while ($head_txt =~ m/\G(.+?)$mycrlf/g) { my $line = $1; # ... parse the field ... }
What made it better? Well, position-marking regex are less commonly used, so
surely they're an efficiency trick, right? Well, no. Actually, I thought this
would help me approximate a lazy split, moving from one record separator to the
next as I process each line. The problem is that I apparently can't use a
pattern with /g
on a dereferenced scalar. I need to make a copy in memory,
which takes up a lot of space. In fact, if I'm going to copy it, I might was
well just split! So, back to looking for useful improvements.
Email::Simple headers are stored as three data structures, two hashes and an array, which are called the head, the header names, and the order. The head has one entry for each header-name; the value is an arrayref of every value for that header, in the order they should appear in the header. The order a list of header names, including duplicates, in the order in which they appear. The header names, bizarrely, relate the lowercased version of a header to the latest casing of it to appear in the input... this is almost certainly a bug introduced when fixing other problems with the header. Still, it means that this message:
Foo: Alfa foo: Bravo Bar: Charlie FOO: Delta Baz: Foxtrot Baz: Gulf bar: Hotel
is stored like this:
$head = { Bar => [ qw(Charlie) ], bar => [ qw(Hotel) ], Baz => [ qw(Foxtrot Gulf) ], Foo => [ qw(Alfa) ], FOO => [ qw(Delta) ], foo => [ qw(Bravo) ], }; $order = [ qw(Foo foo Bar FOO Baz Bar) ]; $header_names = { bar => 'Bar', baz => 'Baz', foo => 'FOO', }
That's a lot of repeated data! Given all those unique headers in our first
test message, we're going to have to store every header name four times: once
in the head, once in the order, and twice in the header names.
Given all those repeated headers, we're still going to store the header name
10,003 times -- once for the entry in the head, twice in the header names (once
lowercased and once verbatim), and ten thousand times in the order.
I guess it's nice to have a super-fast lookup, but how many headers are there really going to be, in a normal message, that you want to have such heavy duplication for optimization? Is hash lookup really going to be faster in real time than a linear search by name acros a hundred headers? If you have ten thousand headers (for... some reason), would you rather that searching take a few milliseconds longer, if it costs you fifty megs of RAM? My wager, here, is that you'd rather optimize for the common case and save plenty of memory for no noticeable change.
So, I replaced the head, the order, and the header names with "headers." It's just a reference to an array of pairs. The above message is now:
my $headers = [ Foo => 'Alfa', foo => 'Bravo', Bar => 'Charlie', FOO => 'Delta', Baz => 'Foxtrot', Baz => 'Gulf', bar => 'Hotel', ];
Nice and simple. How does it compare memory-wise?
~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg just started : 1364 28328 after require File::Slurp : 2232 28704 after slurping : 12196 38656 after require Email::Simple : 12260 38656 after construction : 57804 82732 ~/code/pep/Email-Simple/trunk$ perl -I lib readmail oneheader.msg just started : 1364 28328 after require File::Slurp : 2232 28704 after slurping : 11044 37504 after require Email::Simple : 11108 37504 after construction : 52756 78232
Not great; an eight meg message probably shouldn't take forty megs. Still, savings twenty to sixty megs isn't a bad start. I think I'll cut a development release (which will secretly require the newest Email::MIME, for forward compat reasons) and see what bugs are shaken out.
~/code/pep/Email-Simple/trunk$ perl -I lib readmail big.msg 227@225624:1086
just started: 1364 28328
after require File::Slurp : 2232 28704
after slurping: 17876 44336
after require Email::Simple: 17940 44336
after construction: 25776 52152
~/code/pep/Email-Simple/trunk$ perl -I lib readmailref big.msg 227@225652:1087
just started: 1364 28328
after require File::Slurp : 2232 28704
after slurping: 17876 44336
after require Email::Simple: 17940 44336
after construction: 17960 44336
while ($head_txt =~ m/\G(.+?)$mycrlf/g) {
to
while ($head_txt =~ m/\G(.+?)$mycrlf/go) {
I'm assuming the $mycrlf never changes, right?
Re:/o
rjbs on 2006-11-29T13:47:51
I'm afraid that mycrlf does. It's the line ending that's used in the given email. There are two regex involved: $crlf and $mycrlf. The former matches any valid line ending, the latter matches those which are expected in this header.
If I use $crlf, I should be safe, and that is constant. Switching to use that and enabling/o didn't really help, though.
Here's the better thing, though. It seems that I was wrong in my belief that I couldn't use a/g pattern on a dereferenced string. I don't know why I couldn't get it working yesterday, but now I can. I'm back to using the /g pattern, with no copy of the string, and memory usage is further down: ~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg
just started: 1360 28328
after require File::Slurp : 2228 28704
after slurping: 12192 38656
after require Email::Simple: 12256 38656
after construction: 48268 66844 Will that really help?
Limbic Region on 2006-11-29T17:39:22
See Questions concerning/o regex modifier for details. Re:Will that really help?
rjbs on 2006-11-29T18:04:25
In this instance, anyway, it did not. Perhaps it would if I tested under an older perl.
MJD once gave me and/or the internet an explanation of when it helped, and it was a much smaller case than I had previously thought, so now I usually don't even think about it.:-/
Re:continued regexps on scalar references
rjbs on 2006-11-29T15:51:03
Well, I re-added my/g regexp (as noted in another comment) this morning, and it worked, even when matching against $$string. I did not add /c, though. I don't know what I did wrong yesterday!
I didn't know about/c, though. It doesn't appear to be documented in perlre, although it's hard to search for one character things, sometimes. It does appear in perlreref.
I'm glad to know about it, but it should not be an issue here: at no point should the regex fail and then need to match again.
Thanks!
Re:uniting the array and hash in the symbol table
rjbs on 2006-11-29T17:05:10
What do you suggest "the same variables" are that they should point to?Re:uniting the array and hash in the symbol table
groditi on 2006-11-29T17:38:31
Sorry. I completely slipped, morning coffee not quite kicked in. my apologies.. i thought you wre doing something completely differentRe: self referencing data structure
markjugg on 2006-11-29T18:12:40
Is it possible to use one big structure, where each of the three structures you want are sub-parts of it?As you construct the second and third parts, you can reference the first copy:
Then, dumping it out, you see:my %a;
$a{b} = { c => 'd' };
$a{e}[0] = $a{b};Alternatively, perhaps you could keep just one copy of the structure, and have methods to look up value different ways as-needed, without having whole other structures pre-built, if they if they might not be used.# Notice the self reference...
$VAR1 = {
'e' => [
{
'c' => 'd'
}
],
'b' => $VAR1->{'e'}[0]
};Mark
Re: self referencing data structure
rjbs on 2006-11-29T18:44:25
I looked at doing this, but managing all the references seemed like a colossal pain, and would still use a fair amount of memory over storing one simple structure with less cached lookups.Alternatively, perhaps you could keep just one copy of the structure, and have methods to look up value different ways as-needed, without having whole other structures pre-built, if they if they might not be used.That's what I did, effectively. There is on structure, an array of pairs, and methods that let you do the normal things. You can say "give me the values with name Foo" and it does. It just uses a linear search.
Since Email::Simple 2 will have a Header object with a known interface, a more memory-hungry but faster implementation can be trivially substituted.