Unicode transliteration

jplindstrom on 2002-05-11T16:30:02

I'm doing some processing of the OpenDirectory dmoz RDF dump, but couldn't find a decent way to make sense of the Unicode chars.

After having lucked out on Google and CPAN for a week I finally found the way to transcribe utf8 text to Latin-1:

http://groups.google.com/groups?selm=note-18266%40php.net

My previously home grown version is slightly more complete when it comes to e.g. Romanian chars. It's 100% manual though... (I log missing chars and add them by looking at the dmoz.org web site :)


Unicode transliteration

TorgoX on 2002-05-11T20:32:16

Text::Unidecode?

Re:Unicode transliteration

jplindstrom on 2002-05-12T08:07:32

I looked at this, but according to the docs:

The output of unidecode(...) always consists entirely of US-ASCII characters -- i.e., characters 0x0000 - 0x007F.

Which isn't what I need. I need it to map to the URLs used at dmoz.org, which include 8-bit-chars.

Haven't tried the module though.

Text::Iconv

Matts on 2002-05-12T17:36:26

I tend to use Text::Iconv, because it's extremely fast.