I've been getting an increasing amount of foreign language spam lately - foreign as in german, not the typical asian variants. ordinarily, I don't notice that my spam volume is increasing, except that these have been getting past my filter and ending up in my inbox, I suspect because I haven't taught my filters to pick up on the non-english tokens.
do people with .de
addresses get more native language spam?
this got me thinking - why do I have to train my filters at all? I mean, certainly there is enough spam (english and non-english) floating around the world that a decent corpus could be assembled, regularly added to, and made available. this would have a number of advantages, like a larger corpus for more accurate results. it would also enable SA users to react to new spam forms more quickly than they could on their own - users contribute spam consistently and download a new database regularly and *poof* you have one killer spam-fighting machine.
Re:razor
geoff on 2004-06-10T13:34:52
yeah, I knew it was a good idea:) but really what I want is not another plugin I need to administer.
what I originally thought about was it would be cool to take a german friend and add the results from hissa-learn
sessions to my own, but I'm not sure if there is a routine for merging databases like that or not. making it a globally shared database was just the logical extension.Re:razor
rjbs on 2004-06-10T13:44:22
--dump and --import
sa-learn can --dump its database and --import other databases. --import suggests that it is for old formats, but I'd guess (with no evidence) that it works with current formats. It also says it clobbers the current DB_FILE, but I'd guess (with no evidence) that it could be rewritten to allow merging.
Then you just need CBAN.
The problem with a universal corpus is that one person's spam is another person's ham.
Just to pick a random example, almost any message received by you or me that is written in German is going to be spam. My command of the language doesn't go much deeper than "wo ist der Bierhaus, fraulein?", so anyone sending me an email in German is probably trying to sell me something. So for me, German text is at least a 98% confident indicator of spam.
On the other hand, there are hundreds of millions of German speakers out there who probably have a different opinion on my conclusion. I can hardly blame them.
The whole point of statistical filters like Bayes is that they let individuals come up with statistical descriptions of their own spam/ham corpus. While you could apply some of these techniques to a universal corpus, it wouldn't be anywhere near as accurate or useful as one built around your own mail patterns.
Moreover, part of the reason that individual statistical filters are effective is that they are hetergeneous: because everyone has a different profile, it's difficult or impossible for spammers to come up with messages that will be misinterpreted as ham for everyone. If you have any kind of standard, widely used spam characteristics -- like the heuristics encoded into SpamAssassin's rules -- then spammers can exploit those properties to "cloak" their messages as ham; hence, spam headers that include Mutt, Pine, Outlook, Mozilla, Eudora, and AOL as the mail agent, because each of those gives a few "ham" points in SpamAssassin; hence, revisions to later versions of SpamAssassin to penalize messages that cite multiple mail agents.
A universal spam corpus is an appealing idea, and there are research contexts where having one makes sense, but as examples like the mail agent headers show, filtering techniques built around a universal profile would often be easy to defeat. It would be nice if there were easier ways to package them, but generally speaking it seems like individual profiles seem to be the way to go.
Re:corpuses (corpi?) don't work that way
jmm on 2004-06-10T14:44:03
I think that you can set things up so that your own private ruleset is used to actively delete spam, while the public ruleset would be used to flag spam while still pasing it on. Then, you can scan the flagged items fairly quickly and use them to update your private ruleset.
Re:locale stuff
Matts on 2004-06-10T16:28:02
Actually what you want is the ok_languages setting. Just set this to be "en" (or whatever languages you can read) and be done with it.
I'm (in Germany) have been getting a lot of italian spam lately that was getting across sa. Usually almost all of the spam I get is English, though, maybe one or two german messages per day.
When we first got sa installed at my company, the sysad decided all English mail must be spam and gave it a 4 point spam bonus. I got him to reverse that decision, though