Dealing with bounces

ethan on 2004-05-29T07:07:40

When looking at the annoyance factor of unwanted mail, bounce messages (caused by some insane worms randomly sending mails with arbitrary from-addresses) seem to have overhauled ordinary spam. The problem with those is that they pass my spam filters and now I have to take steps.

I figure that it should be possible to get an almost flawless detection of those bounces with a specially tailored bayesian filter. Note that I don't want to use the existing bayes filter (as part of SpamAssassin for example). I would first have to train them and also, I suspect that real spam and bounces don't have much in common when looking at the used words.

So what I have started doing now is writing a bayesian filter for bounces. First thing I wrote was a flex-scanner that detects valid RFC822 mail addresses. The scanner gets fed one message. It opens a pipe to another process (the one that does the actual filtering) and writes the mail to this process. The only thing the scanner does is replacing every email-address it can find in the body with T_MAILADDR or somesuch. When reading RFC822 correctly, the below should be the rules for a valid email-address:

    atom            [!#$%&'-/0-9A-Za-z_`{}|~^]*
    dtext           [\x00-\x0C\x0E-\x5A\x5E-\x7F]*
    qtext           [\x00-\x0C\x0E-\x21\x23-\x5B\x5D-\x7F]*
    quoted_pair	 "\\"[\x00-\x7F]
    quoted_string   "\""({qtext}|{quoted_pair})*"\""
    word            {atom}|{quoted_string}

    domain_literal  "["({dtext}|{quoted_pair})*"]"
    domain_ref      {atom}
    sub_domain	  {domain_ref}|{domain_literal}
    domain          {sub_domain}("."{sub_domain})*
    local_part	  {word}("."{word})*

    addr_spec       {local_part}"@"{domain}

This should be a huge advantage for a bayesian filter since now not every single email-address is a word for its own but rather they get mapped onto one word.

The idea behind that is of course, that bounce messages tend to have a lot of email addresses in their body. Some of them even include whole header fields, so I could extend the scanner to detect those and generate another token for them.

For now I'll prototype the program that the scanner opens a pipe to in Perl and see whether the approach makes any sense at all. If it does, I can rewrite it in C and have a fairly well-performing bayesian filter that I can plug into my .procmailrc before spamassassin is even triggered.

Seems Like a Lot of Work

chromatic on 2004-05-29T20:14:57

I check the Return-Path header for <>. That catches most of the bounces.

Re:Seems Like a Lot of Work

ethan on 2004-05-30T04:16:57
Actually most of my bounces don't have a Return-Path header at all. The attached original message usually has one, but it always contains one of my addresses.

from: ne from_

Juerd on 2004-05-29T20:59:12

To know which bounces I want to keep and which I want to throw away, a while ago I thought it would be a nice experiment to see if spammers ever use the envelope from that I use. Originally, I would if I did indeed get spam on that address encode some information in the envelope from to find out how they got that information.

But I have still not received a single message on that address that wasn't a bounce that I wanted to read.

The mail address that I use in the headers is juerd@example.com (but with another domain), the envelope from is juerd@c4.example.com. All bounces that I receive that aren't for juerd@c4.example.com (most are for juerd@example.com, obviously), is stored in a separate folder. I'm now confident enough to send things directly to /dev/null, but this is an easy way get statistics :)

There are, unfortunately, auto-replies that aren't bounces. They are sent to the address in the From: header. For that I don't really have a solution.

Re:from: ne from_

vsergu on 2004-05-29T22:33:58
If you send out all your mail with juerd@c4.example.com as the MAIL FROM, then that address will end up in some Received lines in some messages. Some of those messages will end up on computers that get infected with worms. Some of them will end up somewhere on the web, where the address will be harvested by spammers. So soon you'll be getting bounces to that address in response to spam and worms. I guess that means you'll have to change the address periodically.

Of course there are stupid mail systems out there that send bounces to the "From:" address, but they probably don't send them as <> anyway.

Re:from: ne from_

Juerd on 2004-05-31T20:32:22
That's what I thought, but I've been using this for over a year now and I haven't received a single unwanted message at juerd@c4.example.com yet.