Got Corpus?

cog on 2004-10-15T10:08:02

I'm looking for reasonable quantities of text in as many languages as I can get my hands on (note: I mean "text in English", "text in French", etc. I do not mean "text with as many languages as possible inside it").

Basically, I'm looking for better training text for my Lingua::Identify project.

If anyone has a couple of pointers (or even the corpus by itself, even if just of one language), I'd really appreciate that :-)

Oh, one other thing: by "reasonable", I think I'm aiming for something like 10M... but I'd just like to get my hands on corpus, right now (hey, 1M today, 1M tomorrow...)


Google?

Mr. Muskrat on 2004-10-15T14:07:40

Google's advanced search allows you to limit the results to just one particular language; from Arabic to Turkish, you have 35 choices in all.

For example...
Google for Perl in French
Google for Perl in German

Re: Got corpus?

domm on 2004-10-15T15:37:52

I remeber there was a talk at some German Perl workshop from a guy (Richard Jelinek) who's doing language recoginition professionally.

Here's his website: http://www.petamem.com/

Here's his talk: http://www2.perl-workshop.de/2003/contriblist.epl#106

Maybe he can help you (at least if you don't plan to take over his business :-)

11 for you to work with

ambs on 2004-10-15T19:29:55

Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish.

More! They are parallel! (you know I like them this way)

http://people.csail.mit.edu/people/koehn/publications/europarl/

Association des Bibliophiles Universels

BooK on 2004-10-18T09:08:35

ABU make French literary texts (no copyright strings attached) available for free. These are classic works, so maybe the French is a little too classic for your needs. There's a lot of poetry in there as well.