Need asian emails

Matts on 2002-09-30T12:02:22

Anyone know where I can get a good source of asian hams (non-spams)? I need Japanese, Chinese, Taiwanese, and Korean emails to feed to a classifier.


Asian emails

Odud on 2002-09-30T15:17:09

Asian Usenet groups I guess would approximate - presumably if you picked a cross section arts, comp, etc. The only danger is that they contain spam - I don't know what the percentage to non-spam would be.

Re:Asian emails

Matts on 2002-09-30T16:09:36

That's my biggest problem in this - I have absolutely no idea if something in Korean is spam or not... The only differentiator I have is the "other" clues, such as HTML forms, Javascript, etc.

asian non-spam email

wickline on 2002-09-30T18:13:18

Well, I don't have any specific sources, but I can suggest a strategy for obtaining them.

Consider that if you wanted a reliable source of non-spam english email, one method might be to find a selection of email lists which are configured to only allow subscribers to post to the list. Subscribe some address to that list, and start listening in. Most spammers do not bother subscribing to a gazillion email lists in order to post a single spam message (before their posting address gets unsubscribed for their poor behavior).

As some mailing lists are poorly configured and allow access to subscriber lists (which address harvesters take advantage off), you'll probably want to filter the traffic to that address so that you only keep stuff which actually originated from the mailing list server.

Now the problem becomes one of finding asian mailing lists which are known to allow posts by subscriber only. I can think of a couple of ways you could do that, but neither of them are very fun. Hopefully this is something you're being paid to do...

1) find lists with web archives. Post to them. See if your posts appear in the web archive. If the list has a web archive and allows posts from non-subscribers than odds are high that they already get plenty of spam. Perhaps your post could be a message about how this combination of options is a problematic choice, and how they could benefit from only allowing posts by subscribers. If your post never shows up, then they probably only allow subscribers to post.

2) find collections of lists which describe the list options in a standard format. Understand one example of that format well-enough to automate searches through the others for lists which only allow subscribers to post. Some ways to understand the first example in a given collection might include web translation tools, asking someone who speaks the language, or looking at an english version (maybe lists from yahoo are described similarly to lists from yahoo.jp).

Good luck :)

-matt

Re:asian non-spam email

wickline on 2002-09-30T18:26:32

another source...

I remember dejanews used to do a restpectable job of eliminating spam from their web archive of usenet. If google is doing at least as good of a job, you could see how comfortable you are with their spam-culling capabilities in english. If the ratio looks good, then you could try harvesting from their non-English groups.

Unfortunately, this assumes that they do as good of a job in each language. That may be a horribly-flawed assumption, and should certainly be checked. It may be that (like you) they only have structural (rather than content) clues to highlight non-english spam.

Thinking about structural clues gives another strategy...

I'm not sufficiently familiar with usenet spam to know this, but do english usenet spams ever have valid references headers? If they never do (or close enough to never), then you could hope that this is also true of non-english usenet spam (or do a limited test based on your current ability to identify spam from structural clues), and you could restrict your collection to messages with valid references headers.

This could spoil your data by making it more likely for your software to flag non-followup messages as spam, but that could possibly be worked around by also including the referenced message in the collection. This step could introduce spam into the collection, but if you use your existing ability to find spam from structural clues, then hopefully you can keep that to a minimum.

All of the above is less relevant if the differences between usenet and email traffic are likely to spoil your data.

-matt

Re:asian non-spam email

ask on 2002-10-06T11:43:15

> 1) find lists with web archives. Post to them.

That is such a bad idea.

If the list allows everyone to subscribe it's probably because they don't have a problem because of it. You'll be a problem.

If the list doesn't allow everyone to post, the moderator will probably get your mail. He gets plenty spam without you adding to it.

  - ask