optimizing for the weird case

rjbs on 2006-11-29T03:53:53

Holy cow!

knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ perl -I lib readmail headers.msg 
just started                :   1360    28328
after require File::Slurp   :   2228    28704
after slurping              :  12192    38656
after require Email::Simple :  12248    38656
after construction          : 129368   159816

If you try to build an Email::Simple from a message with 10,000 unique headers, be prepared to give up some RAM. It's not quite as bad if you just have 10,000 values for one header:

knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ perl -I lib readmail oneheader.msg 
just started                :   1360    28328
after require File::Slurp   :   2228    28704
after slurping              :  11040    37504
after require Email::Simple :  11096    37504
after construction          :  87380   114852

Still, this is nuts...

knight!rjbs:~/code/pep/Email-Simple/tags/1.996$ ls -l *head*msg
-rw-r--r-- 1 rjbs rjbs 5092896 Nov 28 08:29 headers.msg
-rw-r--r-- 1 rjbs rjbs 4504001 Nov 28 08:56 oneheader.msg

It should not take 100 MB to store five meg message in an object!

The latest trunk of Email::Simple makes this just a bit better:

knight!rjbs:~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg
just started                :   1364    28328
after require File::Slurp   :   2232    28704
after slurping              :  12196    38656
after require Email::Simple :  12260    38656
after construction          : 114444   144892

knight!rjbs:~/code/pep/Email-Simple/trunk$ perl -I lib readmail oneheader.msg
just started                :   1364    28328
after require File::Slurp   :   2232    28704
after slurping              :  11044    37504
after require Email::Simple :  11112    37504
after construction          :  74184   101656

That's hardly a big improvement, though. Reading the message in as a reference, a new feature in the trunk, doesn't help much either: the memory usage isn't in the body storage or copying, it's in the data structures used to store the header. Despite the massive data structure, I decided that my first "simple" fix would be to replace this code:

for (split /$mycrlf/, $$head) {
  #  ... parse the field ...
}

With this "better" code:

my $head_txt = $$head;
while ($head_txt =~ m/\G(.+?)$mycrlf/g) {
   my $line = $1;
   # ... parse the field ...
}

What made it better? Well, position-marking regex are less commonly used, so surely they're an efficiency trick, right? Well, no. Actually, I thought this would help me approximate a lazy split, moving from one record separator to the next as I process each line. The problem is that I apparently can't use a pattern with /g on a dereferenced scalar. I need to make a copy in memory, which takes up a lot of space. In fact, if I'm going to copy it, I might was well just split! So, back to looking for useful improvements.

Email::Simple headers are stored as three data structures, two hashes and an array, which are called the head, the header names, and the order. The head has one entry for each header-name; the value is an arrayref of every value for that header, in the order they should appear in the header. The order a list of header names, including duplicates, in the order in which they appear. The header names, bizarrely, relate the lowercased version of a header to the latest casing of it to appear in the input... this is almost certainly a bug introduced when fixing other problems with the header. Still, it means that this message:

Foo: Alfa
foo: Bravo
Bar: Charlie
FOO: Delta
Baz: Foxtrot
Baz: Gulf
bar: Hotel

is stored like this:

$head = {
  Bar => [ qw(Charlie) ],
  bar => [ qw(Hotel)   ],
  Baz => [ qw(Foxtrot Gulf) ],
  Foo => [ qw(Alfa) ],
  FOO => [ qw(Delta) ],
  foo => [ qw(Bravo) ],
};

$order = [ qw(Foo foo Bar FOO Baz Bar) ];

$header_names = {
  bar => 'Bar',
  baz => 'Baz',
  foo => 'FOO',
}

That's a lot of repeated data! Given all those unique headers in our first test message, we're going to have to store every header name four times: once in the head, once in the order, and twice in the header names.
Given all those repeated headers, we're still going to store the header name 10,003 times -- once for the entry in the head, twice in the header names (once lowercased and once verbatim), and ten thousand times in the order.

I guess it's nice to have a super-fast lookup, but how many headers are there really going to be, in a normal message, that you want to have such heavy duplication for optimization? Is hash lookup really going to be faster in real time than a linear search by name acros a hundred headers? If you have ten thousand headers (for... some reason), would you rather that searching take a few milliseconds longer, if it costs you fifty megs of RAM? My wager, here, is that you'd rather optimize for the common case and save plenty of memory for no noticeable change.

So, I replaced the head, the order, and the header names with "headers." It's just a reference to an array of pairs. The above message is now:

my $headers = [
  Foo => 'Alfa',
  foo => 'Bravo',
  Bar => 'Charlie',
  FOO => 'Delta',
  Baz => 'Foxtrot',
  Baz => 'Gulf',
  bar => 'Hotel',
];

Nice and simple. How does it compare memory-wise?

~/code/pep/Email-Simple/trunk$ perl -I lib readmail headers.msg
just started                :   1364    28328
after require File::Slurp   :   2232    28704
after slurping              :  12196    38656
after require Email::Simple :  12260    38656
after construction          :  57804    82732

~/code/pep/Email-Simple/trunk$ perl -I lib readmail oneheader.msg 
just started                :   1364    28328
after require File::Slurp   :   2232    28704
after slurping              :  11044    37504
after require Email::Simple :  11108    37504
after construction          :  52756    78232

Not great; an eight meg message probably shouldn't take forty megs. Still, savings twenty to sixty megs isn't a bad start. I think I'll cut a development release (which will secretly require the newest Email::MIME, for forward compat reasons) and see what bugs are shaken out.


how this affects normal messages

rjbs on 2006-11-29T03:59:52

I thought I'd mention that this has, predictably, very little effect on normal messages, like those I used in previous profiling. These messages had 50 headers and 10,000 body lines:
~/code/pep/Email-Simple/trunk$ perl -I lib readmail big.msg     227@225624:1086
just started                :   1364    28328
after require File::Slurp   :   2232    28704
after slurping              :  17876    44336
after require Email::Simple :  17940    44336
after construction          :  25776    52152
~/code/pep/Email-Simple/trunk$ perl -I lib readmailref big.msg  227@225652:1087
just started                :   1364    28328
after require File::Slurp   :   2232    28704
after slurping              :  17876    44336
after require Email::Simple :  17940    44336
after construction          :  17960    44336

/o

petdance on 2006-11-29T06:12:54

Change this while ($head_txt =~ m/\G(.+?)$mycrlf/g) { to while ($head_txt =~ m/\G(.+?)$mycrlf/go) { I'm assuming the $mycrlf never changes, right?

Re:/o

rjbs on 2006-11-29T13:47:51

I'm afraid that mycrlf does. It's the line ending that's used in the given email. There are two regex involved: $crlf and $mycrlf. The former matches any valid line ending, the latter matches those which are expected in this header.

If I use $crlf, I should be safe, and that is constant. Switching to use that and enabling /o didn't really help, though.

Here's the better thing, though. It seems that I was wrong in my belief that I couldn't use a /g pattern on a dereferenced string. I don't know why I couldn't get it working yesterday, but now I can. I'm back to using the /g pattern, with no copy of the string, and memory usage is further down:
~/code/pep/Email-Simple/trunk$ perl -I lib  readmail headers.msg
  just started                :   1360    28328
  after require File::Slurp   :   2228    28704
  after slurping              :  12192    38656
  after require Email::Simple :  12256    38656
  after construction          :  48268    66844

Will that really help?

Limbic Region on 2006-11-29T17:39:22

See Questions concerning /o regex modifier for details.

Re:Will that really help?

rjbs on 2006-11-29T18:04:25

In this instance, anyway, it did not. Perhaps it would if I tested under an older perl.

MJD once gave me and/or the internet an explanation of when it helped, and it was a much smaller case than I had previously thought, so now I usually don't even think about it. :-/

continued regexps on scalar references

ChrisDolan on 2006-11-29T15:17:53

The problem is that I apparently can't use a pattern with /g on a dereferenced scalar.

You certainly can. I use this technique extensively in CAM::PDF to incrementally parse a PDF document into a DOM. I pass a scalar reference to the content from one sub to the next.

Without delving too deep into your code, I believe the important bit that you are missing is the "c" flag on the regexp.

Re:continued regexps on scalar references

rjbs on 2006-11-29T15:51:03

Well, I re-added my /g regexp (as noted in another comment) this morning, and it worked, even when matching against $$string. I did not add /c, though. I don't know what I did wrong yesterday!

I didn't know about /c, though. It doesn't appear to be documented in perlre, although it's hard to search for one character things, sometimes. It does appear in perlreref.

I'm glad to know about it, but it should not be an issue here: at no point should the regex fail and then need to match again.

Thanks!

uniting the array and hash in the symbol table

groditi on 2006-11-29T16:38:15

is there any reason why before, when you have the hash and array you didnt just mess a little with the symbol table to effectively make them point to the same variables, almost cutting memory use in half (well there is still the sort order array...) without having to give up your initial strategy for easy lookups?

Re:uniting the array and hash in the symbol table

rjbs on 2006-11-29T17:05:10

What do you suggest "the same variables" are that they should point to?

Re:uniting the array and hash in the symbol table

groditi on 2006-11-29T17:38:31

Sorry. I completely slipped, morning coffee not quite kicked in. my apologies.. i thought you wre doing something completely different

Re: self referencing data structure

markjugg on 2006-11-29T18:12:40

Is it possible to use one big structure, where each of the three structures you want are sub-parts of it?

As you construct the second and third parts, you can reference the first copy:

my %a;
$a{b} = { c => 'd' };
$a{e}[0] = $a{b};
Then, dumping it out, you see:
# Notice the self reference...
$VAR1 = {
          'e' => [
                   {
                     'c' => 'd'
                   }
                 ],
          'b' => $VAR1->{'e'}[0]
        };
Alternatively, perhaps you could keep just one copy of the structure, and have methods to look up value different ways as-needed, without having whole other structures pre-built, if they if they might not be used.

Mark

Re: self referencing data structure

rjbs on 2006-11-29T18:44:25

I looked at doing this, but managing all the references seemed like a colossal pain, and would still use a fair amount of memory over storing one simple structure with less cached lookups.
Alternatively, perhaps you could keep just one copy of the structure, and have methods to look up value different ways as-needed, without having whole other structures pre-built, if they if they might not be used.
That's what I did, effectively. There is on structure, an array of pairs, and methods that let you do the normal things. You can say "give me the values with name Foo" and it does. It just uses a linear search.

Since Email::Simple 2 will have a Header object with a known interface, a more memory-hungry but faster implementation can be trivially substituted.