«
The
crap that Japanese people put up with in their software
just because it has a smidgen of Japanese support is
intolerable; it shouldn't be that 99% of American
programmers have no idea what an umlaut is, or what the
differences between Japanese
and Chinese are.
»
-- Ben's journal
Re:Nice
TorgoX on 2002-03-11T05:06:14
...what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1...Send them both with all characters over 0x80 encoded as &#number; entities. Does that solve the problem?
Re:Nice
pudge on 2002-03-11T05:28:45
Apache::MP3 still needs to know how to encode the specific characters. Don't some characters over 0x80 differ between Latin-1 and UTF-8?Re:Nice
TorgoX on 2002-03-11T05:37:19
Latin-1 is a subset of Unicode.What do you mean by "Apache::MP3 still needs to know how to encode the specific characters."? What encoding to declare the HTML as being in? It doesn't matter, if everything outside of 00-7F is turned into &#number; (or %xx in a URL -- which you do to the bytes, not the characters, incidentally).
Re:Nice
pudge on 2002-03-11T05:57:16
If Latin-1 is a subset of Unicode, then why do Latin-1 characters get munged when read as part of a UTF-8 document? I changed one letter of a directory to be ï (i with an umlaut) in Latin-1, and when read as UTF-8, it was messed up. When read as Latin-1, it was fine. In Latin-1, it has a value of decimal 239. Does it have the same value in UTF-8? If so, then what good would it be to print ï, since it's already known to be byte 239... wouldn't it still need to be specially encoded somehow so it is known that it is a standalone character instead of part of a multibyte sequence?
Re:Nice
TorgoX on 2002-03-11T07:20:52
OK, I think you're confusing the encoding and the content. Character point 239 is i-uml in both Latin-1 and Unicode. That's the content.However, you need to pick one of three encodings: as UTF8, as raw, or as an entity reference.
- If you express 239 as UTF8, it's bytes 0xC3 0xAF, and you should express that this document is encoded as UTF8.
- If you express 239 as raw (an encoding which works only for characters up to 0xFF) it's the single byte 0xEF.
- If you encode it as entity reference, it's "ï", regardless of what this document's declared encoding is (even if it's neither Latin-1 nor Unicode! Spooky, huh?). &#num; always means character number num in Unicode, and the character numbers are the same as Latin-1, for 0-255.
Re:Nice
pudge on 2002-03-11T13:56:41
I didn't confuse encoding and content, per se; I merely thought ï would, in UTF-8, stand for the byte 239, not character 239. Hum! OK, I'll play around a bit, thanks.