I'm doing some processing of the OpenDirectory dmoz RDF dump, but couldn't find a decent way to make sense of the Unicode chars.
After having lucked out on Google and CPAN for a week I finally found the way to transcribe utf8 text to Latin-1:
http://groups.google.com/groups?selm=note-18266%40php.net
My previously home grown version is slightly more complete when it comes to e.g. Romanian chars. It's 100% manual though... (I log missing chars and add them by looking at the dmoz.org web site :)
Re:Unicode transliteration
jplindstrom on 2002-05-12T08:07:32
I looked at this, but according to the docs:
The output of unidecode(...) always consists entirely of US-ASCII characters -- i.e., characters 0x0000 - 0x007F.
Which isn't what I need. I need it to map to the URLs used at dmoz.org, which include 8-bit-chars.
Haven't tried the module though.