Anyone know where I can get a good source of asian hams (non-spams)? I need Japanese, Chinese, Taiwanese, and Korean emails to feed to a classifier.
Re:Asian emails
Matts on 2002-09-30T16:09:36
That's my biggest problem in this - I have absolutely no idea if something in Korean is spam or not... The only differentiator I have is the "other" clues, such as HTML forms, Javascript, etc.
Re:asian non-spam email
wickline on 2002-09-30T18:26:32
another source...
I remember dejanews used to do a restpectable job of eliminating spam from their web archive of usenet. If google is doing at least as good of a job, you could see how comfortable you are with their spam-culling capabilities in english. If the ratio looks good, then you could try harvesting from their non-English groups.
Unfortunately, this assumes that they do as good of a job in each language. That may be a horribly-flawed assumption, and should certainly be checked. It may be that (like you) they only have structural (rather than content) clues to highlight non-english spam.
Thinking about structural clues gives another strategy...
I'm not sufficiently familiar with usenet spam to know this, but do english usenet spams ever have valid references headers? If they never do (or close enough to never), then you could hope that this is also true of non-english usenet spam (or do a limited test based on your current ability to identify spam from structural clues), and you could restrict your collection to messages with valid references headers.
This could spoil your data by making it more likely for your software to flag non-followup messages as spam, but that could possibly be worked around by also including the referenced message in the collection. This step could introduce spam into the collection, but if you use your existing ability to find spam from structural clues, then hopefully you can keep that to a minimum.
All of the above is less relevant if the differences between usenet and email traffic are likely to spoil your data.
-matt
Re:asian non-spam email
ask on 2002-10-06T11:43:15
> 1) find lists with web archives. Post to them.
That is such a bad idea.
If the list allows everyone to subscribe it's probably because they don't have a problem because of it. You'll be a problem.
If the list doesn't allow everyone to post, the moderator will probably get your mail. He gets plenty spam without you adding to it.
- ask