MP3::Info and Unicode

pudge on 2002-02-15T16:58:07

So Che_Fox wants MP3::Info to handle Unicode strings. Well, he and others had recently helped me fix some problems with MP3::Info on ID3v2 tags and encoding bytes, so sure, let's look at it.

We figured we could just identify which strings are UTF-16 (the default for ID3v2; UTF-8 is not even supported until ID3v2.4.0, which most software doesn't even support yet) and convert them to UTF-8.

if ($uniconvert && ($encoding eq "\001" || $encoding eq "\002")) { # UTF-16, UTF-16BE my $u = Unicode::String::utf16($data); $data = $u->utf8; }

That worked fine, until we relalized that Unicode::String was leaving in the byte-order mark (BOM) and we don't want that. So we strip it out after the fact:

$data =~ s/^\xEF\xBB\xBF//; # strip BOM

Hopefully, that's the right thing. And it seems to work.

But then we realize that some tags might be Latin-1 and others might be UTF-8; so what to do? Well, we can convert everything to UTF-8, which will be fine, except that it will break things that want everything to be in Latin-1.

Bah.

I think we're going to make a switch of some kind to tell MP3::Info to convert everything to UTF-8. Bah, again, I say!


The Truth about Unicode

mirod on 2002-02-15T17:27:58

We all want Unicode to work, and there is no question that it is the Right Thing (tm) to do, being open, allowing other cultures to join us and use their own writing scheme and all.

The sad truth is that it is actually a huge pain in the ass to implement for most coders, at least in the US and especially in Europe, and I would be really interested to know if it makes things really easier for Asian coders.

Plus Unicode is usually being forced upon us by XML, which is never a nice thing when you are already fighting with a new technologu and have deadlines to meet ;--(

Re:The Truth about Unicode

pudge on 2002-02-15T17:57:42

Is it worse in Europe specifically because UTF-8 and 8-bit Latin-1 are incompatible?

Re:The Truth about Unicode

mirod on 2002-02-15T18:47:14

Yes, XML parsers not only tend to die a swift but painful death when they encounter a Latin-1 (or 2 or more) character, even in a CDATA section, but also, at least XML::Parser converts everything to UTF-8, even if the rest of the environment is entirely Latin-n. This is extremely annoying as it adds an extra level of complexity to all applications, and forces people to care about encodings when really they don't want to.

Re:The Truth about Unicode

Elian on 2002-02-15T18:19:54

The problem with the Asian languages is most of them already have a perfectly serviceable local standard. Big5 (traditional and simplified) for Chinese and Shift-JIS (amongst others) for Japanese. Korean and Vietnamese also have standards that work just fine.

Unicode's in some ways more of a change for them than for us--while ASCII maps to Unicode (especially the utf8 encoding) with no change, the same can not be said for the asian languages. For them Unicode's more than just an annoyance, it's something that requires wholescale (and potentially lossy) conversions.

UTF8 versus Latin-1

TorgoX on 2002-02-15T18:53:32

Why not just write things as Latin-1 if they consist only of characters [\x00-xFF], and UTF8 otherwise?

Re:UTF8 versus Latin-1

pudge on 2002-02-15T19:09:11

Won't those characters show up wrongly when you expect to see UTF-8 characters, then? I don't really understand. Let's say I have ÿ, \xFF. I assume that character has some other byte representation in UTF-8. But how is that byte represented in UTF-8? Do you understand what it is that I don't understand?

Re:UTF8 versus Latin-1

TorgoX on 2002-02-15T19:18:03

I'm assuming all mp3-readers auto-detect encoding, so there's no "expecting to see UTF8" -- if you see UTF8, you see it and decode it as such, otherwise you assume it's something else. Remember, pretty much only UTF8 looks like UTF8.

Or: if mp3s have an explicit settign that says what encoding something is, then presumably there's no guesswork involved at all.

Re:UTF8 versus Latin-1

pudge on 2002-02-15T19:46:53

MP3 tags aren't just for MP3 readers, they are for web browsers, databases, text files of various kinds, etc.

Re:UTF8 versus Latin-1

TorgoX on 2002-02-15T21:07:58

My "mp3 reader", I mean anything that accesses the tag data in the files, including libraries that just pass it on to other applications.

But anyway. Ideally, calling applications (like a CGI that passes on the tag data) should make clear what kinds of data-encoding they can or can't cope with.