Internationalization

TorgoX on 2002-03-11T02:53:15

« The crap that Japanese people put up with in their software just because it has a smidgen of Japanese support is intolerable; it shouldn't be that 99% of American programmers have no idea what an umlaut is, or what the differences between Japanese and Chinese are. »
-- Ben's journal


Nice

pudge on 2002-03-11T03:59:25

It'd be nice to do, if it weren't so difficult. I've already somewhat documented the annoyances of trying to get UTF8 support in MP3::Info, and additionally in getting that to play nicely with Apache::MP3 (what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1?). It is not an easy thing to do, and you need to weigh the cost versus the benefit.

Consider that charsets are difficult to understand for those that don't already understand them, which a truism, but relevant since most American computer programmers don't need to understand them. Consider that, similarly, most American computer programmers don't have a use for them, so adding support for them not only has no direct benefit, but additionally doesn't scratch that developer's itches. Blah blah blah. Is magical handling of I18N the next killer app?

Re:Nice

TorgoX on 2002-03-11T05:06:14

...what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1...

Send them both with all characters over 0x80 encoded as &#number; entities. Does that solve the problem?

Re:Nice

pudge on 2002-03-11T05:28:45

Apache::MP3 still needs to know how to encode the specific characters. Don't some characters over 0x80 differ between Latin-1 and UTF-8?

Re:Nice

TorgoX on 2002-03-11T05:37:19

Latin-1 is a subset of Unicode.

What do you mean by "Apache::MP3 still needs to know how to encode the specific characters."? What encoding to declare the HTML as being in? It doesn't matter, if everything outside of 00-7F is turned into &#number; (or %xx in a URL -- which you do to the bytes, not the characters, incidentally).

Re:Nice

pudge on 2002-03-11T05:57:16

If Latin-1 is a subset of Unicode, then why do Latin-1 characters get munged when read as part of a UTF-8 document? I changed one letter of a directory to be ï (i with an umlaut) in Latin-1, and when read as UTF-8, it was messed up. When read as Latin-1, it was fine. In Latin-1, it has a value of decimal 239. Does it have the same value in UTF-8? If so, then what good would it be to print ï, since it's already known to be byte 239 ... wouldn't it still need to be specially encoded somehow so it is known that it is a standalone character instead of part of a multibyte sequence?

Re:Nice

TorgoX on 2002-03-11T07:20:52

OK, I think you're confusing the encoding and the content. Character point 239 is i-uml in both Latin-1 and Unicode. That's the content.

However, you need to pick one of three encodings: as UTF8, as raw, or as an entity reference.

  • If you express 239 as UTF8, it's bytes 0xC3 0xAF, and you should express that this document is encoded as UTF8.
  • If you express 239 as raw (an encoding which works only for characters up to 0xFF) it's the single byte 0xEF.
  • If you encode it as entity reference, it's "ï", regardless of what this document's declared encoding is (even if it's neither Latin-1 nor Unicode! Spooky, huh?). &#num; always means character number num in Unicode, and the character numbers are the same as Latin-1, for 0-255.

Re:Nice

pudge on 2002-03-11T13:56:41

I didn't confuse encoding and content, per se; I merely thought ï would, in UTF-8, stand for the byte 239, not character 239. Hum! OK, I'll play around a bit, thanks.

Libraries

koschei on 2002-03-12T06:03:52

I do a lot of my journal reading in a library at university. Today I was doing it in the 2nd floor lab - just by the P-PS range of books (as in, I go out the door of the lab and am faced with P200-220).

So I looked that book up, found it was P211, spotted it from the lab =) I'll borrow it when I leave.