Parsing Fast

cwest on 2004-05-28T03:12:30

While writing Email::Address I learned something. Originally I implemented it using Parse::RecDescent. I built up a complete grammar with all sorts of actions tree and eventually produced a wonderful parse tree. It was correct, down to infinitely nested comments. It was also very slow. I couldn't get it on par with Mail::Address so I had to ditch the grammar.

My next though was a home grown tokenizer, like Mail::Internet has, but that sounded dirty. I decided to try regexes, and that worked out. There are some limitations, such as a lack of recursion (without doing some very ugly black magic). I was worried about having to resort to some lame arbitrary level of nesting support, with no way to make it configurable. I knew I had to find a limit, I just didn't want it to be a hard limit. Compiled regular expressions to the rescue. Because comments are nested, the comment content is dependent on the comment expression, and vice versa. This allowed me to get my nested structure regular expression cheaply.

my ($ccontent, $comment) = ('')x2;
for (1 .. $COMMENT_NEST_LEVEL) {
   $ccontent       = qr/$ctext|$quoted_pair|$comment/;
   $comment        = qr/\s*\((?:\s*$ccontent+)\s\)\s*/;
}

So, rock on. Set the loop limit variable as you like for further nesting. Now, an excerpt from the docs about the speed I was able to discover.

On my 877Mhz 12" Apple Powerbook I can run the distributed benchmarks and get results like this.

$ perl -Ilib bench/ea-vs-ma.pl bench/corpus.txt 5 
               s/iter  Mail::Address Email::Address
Mail::Address    2.44             --           -64%
Email::Address  0.884           176%             --
$ perl -Ilib bench/ea-vs-ma.pl bench/corpus.txt 25
               s/iter  Mail::Address Email::Address
Mail::Address    2.45             --           -73%
Email::Address  0.652           276%             --
$ perl -Ilib bench/ea-vs-ma.pl bench/corpus.txt 50
               s/iter  Mail::Address Email::Address
Mail::Address    2.43             --           -76%
Email::Address  0.585           316%             --

Posted from caseywest.com, comment here.