Unicode, unicode

darobin on 2002-01-24T18:09:21

It is often heard that Perl isn't yet very Unicode friendly, but that it'll be fixed in 5.8 (definitely a good thing). People point fingers at Perl for that but little do they realize how hard the problem is, and how much of the world is broken.

For instance, while trying to fill an XSS vulnerability in Apache::Util::escape_html, Geoff Young and I found out that that sub wasn't UTF-8 aware. If you have double-byte chars one byte of which happens to match, say, < it'll happily turn that part of the char into < which will of course not be exactly what you want :-)

And of course, UTF-8 is just the tip of the problem. Anyone using UTF-16 will most likely see the strings they feed to C libs happily truncated at the first sign of a 0x00 hanging around in there, which is likely to be the first char of the string if the content happens to be simple english characters.

In fact, that's what happens when you feed UTF-16 to Apache::Util, as well as probably to any C lib not equipped to deal with mutlibyte chars. Other problems include conversions between \n and \r\f and co. which'll just break UTF-16 (this is a problem in XML::SAX::Writer which I hope to partially fix).

So Perl is far from being the only unfriendly part here, in fact it's probably one of the most friendly components. I have no end of admiration for the people that are working on fixing such complex issues and even making them DWIM.

Note to self: don't use Apache::Util unless there is certainty that the strings are encoding safe.