Bayesian spam filtering

inkdroid on 2002-10-24T16:52:21

I've started using a spam filtering tool written by Gary Arnold in Perl. It is an application of Paul Graham's idea for using Bayesisan classification to filter spam.

All you need is a directory of spam emails and a directory of good emails to serve as corpi (sic) which are then used once at install time to build a BerkeleyDB of statistics on words in good and bad email.

Once you've got your BerkeleyDB you add a line to your .qmail file so that incoming messages are filtered through Gary's program, which causes spam to be redirected to a particular maildir.

The cool thing is that you can check your spam mailbox periodically, remove any false positives (if there are any), and then rebuild your BerkeleyDB with the mail. So if you cron the DB rebuild process Qmail will magically learn what is Spam from how you classify your email.