Perl Unicode

Matts on 2002-01-29T22:35:44

<rant> Ugh. What a mess. I've spent this afternoon and evening hacking on UTF8 support for XML::SAX::PurePerl. Prior to today if you were running on Perl less than 5.7.2 it would punt on Unicode issues and tell you it couldn't do them.

That seemed wholely unfair, so I decided to try and fix things.

Doing this stuff on lower perls (lower than 5.7.2 that is) is hard. Really hard. The docs are conflicting, the implementations are broken, and it's just general mayhem and dismay everywhere. Very annoying.

Take for example regexps. You'd think we could get those right first time for unicode huh? Nope. Try doing something like: eval '$re1 = qr/[\x{0100}-\x{01FF}]/;'; "foo" =~ /^$re1$/; This is a perfectly legal construct, yet on 5.6.1 it borks out complaining about "Illegal hex character {" or something. Grr. Took me absolutely ages to figure out I wasn't doing something stupid. It turns out the solution is as simple as: eval '$re1 = qr/[\x{0100}-\x{01FF}]/; $re2 = qr/^$re1$/;'; "foo" =~ $re2; That should work on 5.6.1 and 5.7.2, but I can't test the latter until tomorrow (though I'm not too worried since there's not a whole lot of people using it).

Then there's upgrading/downgrading scalars to UTF8. In 5.7.2 it's a simple utf8::upgrade/downgrade. But in 5.6.0 it's tr///CU or something, and in 5.6.1 it's pack("U0A*"). I mean how cryptic can you get? And is that documented clearly anywhere? Not that I can find. At least not in the 5.6.1 docs. There are some docs that ALLUDE you towards those solutions, but it's like trying to find a needle in a haystack when perlunicode says "Look into pack("U0...")". I'll look into the bleedperl docs when I get in tomorrow.

It's a shame this stuff basically makes it very hard to do "cross platform" unicode work in Perl, where by "cross platform" I mean reliably on anyone's version of perl. Ironically the easiest platform of them all was 5.005, where utf8 can just be treated as bytes and worked on that way. Actually that's a bit of a lie - 5.7.2 was very easy once I figured out the binmode stuff, and worked around a couple of (now fixed) bugs.

As a final rant - why doesn't sourceforge's compile farm have 5.6 on anywhere? Sucks. I had to borrow an account from Barrie Slaymaker. Thanks Barrie.

Anyway, XML::SAX 0.07 is now on CPAN, and PurePerl now does UTF8 docs properly on all Perl's that I have to test on. Next release adds a nice Intro to SAX doc that should help lots of people (and since it's linked from the other docs its way overdue). </rant>


Unicode, pre-5.7

TorgoX on 2002-01-30T01:59:21

I've simply decided that so many things to do with Unicode before 5.7.2 are so broken that there's just no point trying to support it; once you get one thing fixed, it reveals six other things, in-core, that are broken.