Mail parsing

Matts on 2002-04-30T18:04:42

People may be wondering why I've not been doing much in the world of XML lately, well it's because I've been busy with my anti-spam stuff. At work I'm developing their next generation anti-spam solution (at the moment it's just based on enabling one or more realtime block lists). Most of it is based on SpamAssassin, but I'm working on some new stuff using Bayesian Probability, which is pretty interesting. At the moment it's getting about 85-90% effectiveness (that's just the Bayes stuff - I'll be combining that with the SpamAssassin rules stuff to get an even higher catch rate), but I think I can get it a bit higher than that with some tuning. Plus someone I know from #axkit is trying to talk me into using bayesian neural nets, but I'll have to see about that - it's already at the point where my brain is cracking under the strain!

Perhaps the largest part of this work has been in doing improved email parsing. We have a fantastic email parser at work, but I wanted to do it in Perl. I already had some old code lying around, so I basically improved on that.

So why didn't I use some other CPAN module? Well several reasons:

1. They all seem to use RAM to parse emails. Well we receive attachments in the multi-megabyte size, and the email parsing modules start to suck up gobs of RAM when they encounter these (has this changed since last time I checked?). The one I wrote uses temp files for everything (including the email body).

2. They don't make any effort to decode the content from the given encoding to UTF-8. Mine decodes everything to UTF-8. Maybe this has also changed since last time I checked.

3. Attempts to act like email clients in the way it decodes stuff.

4. I wanted to do it myself as an exercise. Your own code is always easier to hack on than someone elses.

Anyway, if anyone wants the code, I'd be willing to consider releasing it under a private namespace. Let me know if there's any interest whatsoever. I need to do some more testing on it - I've got 20,000 emails to run it through from the last couple of days traffic on one of our servers. If it can parse all of those, I think I'll have pretty good coverage.

Parsing Email

ziggy on 2002-04-30T18:43:02

They all seem to use RAM to parse emails. Well we receive attachments in the multi-megabyte size, and the email parsing modules start to suck up gobs of RAM when they encounter these (has this changed since last time I checked?). The one I wrote uses temp files for everything (including the email body).

I've written a mailbox iterator a couple of times. Every time I finish, I ask myself if it's something worth releasing, and more often than not, the answer I come up with is "no".

The technique I use is to get an open filehandle for a mailbox (good for "zcat mbox.gz |"), and then load up one message at a time, stopping at the next messge or the end of file. All that really boils down to is treating a line that matches /^From .*\d{4}$/ as start-of-[next-]message. That always feels so trivial.

Once that's done, then it's a simple issue of shoving that scalar at Graham's mail parser and calling it a day. But the kind stuff I do with email doesn't get into attachments or charsets (yet).

What I'd like to see is a mail parsing library in C. The few times I've started one of these projects, I'm amazed at how fast Mutt plows through a mailbox, and how slow it takes Mail::* to do the same thing.

Re:Parsing Email

Matts on 2002-04-30T19:13:09
Actually I did see one on freshmeat just the other day... Ah yes, there it is. Looks quite a bit like our parser at work (only probably doesn't support as many freaky fringe conditions as ours does, but most people don't need that).

I should also do some timing on mine to see how fast it is. I imagine mutt is fast simply because it punts scanning the email until "later", so it would be really tricky to compare its speed to something aimed at parsing a single email.

Bayesian neural nets

jdavidb on 2002-04-30T21:47:47

I know what Bayesian networks are, and I know what neural networks are, but I don't know what Bayesian neural networks are.

That said, I just took off all day Monday so I could spend all night Sunday writing a pure-Perl implementation of a multilayer feedforward neural network with backpropagation training algorithm, using PDL. This is yours for the asking, if it's useful and you want it. (I'm speculating Bayesian neural networks are going to be so different that nothing here would be useful.)

Re:Bayesian neural nets

Matts on 2002-04-30T22:20:53
Could potentially be very useful, if it's easy to use. matt@sergeant.org if you want to send it along. I would of course be using it in a non-free software project.

I think bayesian neural networks basically use Bayes' theory of probability to determine the network, rather than simpler training algorithms. But I'm guessing - I haven't read up on it yet.

If anyone's interested, I'm now getting about 95% accuracy on spam detection, and about 90% accuracy on non-spam detection (the systems tells me if it thinks it's spam, or not spam, or doesn't know). I get about a 1% error rate (meaning the remaining figure is for the emails the system couldn't classify, which is fine for my purposes). So not too bad. 1% is a bit too much (we get about 7 false positives per million with our anti-virus stuff, so I've got a lot to live up too!), so hopefully combining it with the other stuff I do, we can reduce that somewhat. Also this is based on an actual live feed off one of our servers. If you use your own email as a training set, especially geek-type stuff, it's much more accurate.