Bayesian filtering code to play with

dws on 2002-08-19T20:14:42

Several people have asked, so here (link is forever 404, sorry) is a partial implementation of Paul Graham's Bayesian spam filtering algorithm, suitable for experimenting with. I'm currently reworking the final filter so that it doesn't slurp the entire token weighting table into memory (which is great when you want to run it against an mbox, but is overkill for testing a single email).

See the README for some ideas on how to expand this into something usable.


Similar code...

Matts on 2002-08-20T07:01:30

I also posted very similar code to do this to the SpamAssassin-Talk mailing list yesterday, in case anyone is interested in a slightly different encoding of the algorithm.

I use my own mail parser class that doesn't use memory (it uses temp files instead), and decodes all the MIME stuff for you. Might be worth checking out too in case anyone is interested.

We'll probably plug this into SA 2.41+ or SA3 (whichever comes first).

Re:Similar code...

dws on 2002-08-20T07:38:21

If you've got SpamAssasin covered, I'll keep going on a Mail::Audit plugin (which also handles MIME). I've reworked the algorithm to scan the weighting file after tokenizing the message body. No more sucking everything into memory.

By the way, a simple tokenizer tweak cut my falst negatives in half. I only force a token to lowercase if at least one character is already lowercase. This has the effect of keeping a separate (high) weights for "MILLION" and "EMAILS" than for "million" and "emails", which have low weights. Thus "25 MILLION EMAILS" scores a lot higher. The downsize is that the weighting file got 20% larger.

Re:Similar code...

Matts on 2002-08-20T15:20:19

I've reworked mine to use SQLite. Seems to work well. Database is 17MiB though, so I think I need to investigate other options too.

The upper/lower case thing didn't make one squat of a difference for me.

Re:Similar code...

dws on 2002-08-20T20:58:48

Are you sure you're tokenizing consistently? I got bit by that once.

When querying, are you going after tokens one at a time, batching up requests using IN (), or trying to get them all at once using a JOIN?

Re:Similar code...

Matts on 2002-08-21T06:36:39

Inserting into a temporary table, and using a LEFT OUTER JOIN.