Viagra, meet Bayes

dws on 2002-08-17T04:28:34

Paul Graham has written a wonderful article, A Plan for Spam, about using Bayesian techniques for detecting spam. The idea is to cull word frequencies from collections of known spam and non-spam, calculating the likelihood that a given word comes from spam, and then to use a Bayesian algorithm for computing a spam-likelihood score on an incoming email. Graham claims that the algorithm is resiliant to the type of arms race we're seeing now as spammers try to outwit SpamAssassin.

Graham's few example snippets are in Lisp. The transliteration to Perl is so far straightforward. Based on some Q&A today at PerlMonks, others have the same idea. After a few hours of work, I've run word frequency counts on a large pile of spam (I knew there was a reason to save it), and on a large pile of non-spam, and have cranked out a table of probabilities that a word is spam. Now I'm running the algorithm against the saved spam, looking for false negatives. Next up is to run the algorithm against good mail, looking for false positives.

Then comes the decision on how to hook things up. Should be easy to either shoe-horn into my existing procmail script. Or perhaps Mail::Audit is a better platform. Decisions, fun decisions...