UTF-8 fun in Perl

IlyaM on 2003-06-12T21:41:38

List of minor and not very minor annoynances I've meet in last ~3 days while doing conversion to UTF-8 of the project I'm working on:

  1. Perl 5.6.x is just broken when it comes to Unicode support what means if you need Unicode support in Perl you must upgrade. And I though I'd wait for 5.8.1 before upgrading. Naive me :( - had to upgrade my and other developers computers and a production server.
  2. Seems XS modules in 5.8.0 don't play well with UTF-8 strings. Examples: if you give an UTF-8 string to Text::CSV_XS it returns you a non-UTF-8 string, if Template Toolkit configured to use Template::Stash::XS then UTF-8 strings may not work in templates.
  3. None of Perl modules/templating systems/etc I know do the right thing when URL-escaping UTF-8 strings.
  4. If you expect UTF-8 strings in query parameters and URLs you have to wrap Apache::Request to convert query parameters and URLs transparently in Perl UTF-8 strings. If would be nice if it such feature were built-in in this module (I wonder if CGI, CGI::Simple, etc have same problem).
  5. $hash{bareword} doesn't work if the bareword is a non-ASCII bareword (which should work as I understand when you have use utf8 in your Perl code). What is also interesting that it doesn't produce any warnings or errors - it just silently returns undef.
  6. GraphViz doesn't work correctly with UTF-8 strings - it seems to generate correct string in .dot format but when the module calls IPC::Run the string corrupts while being passed to the 'dot' program.
.. I expect to meet even more problems, annoyances ang just glitches.


That's a good list but it's just the start...

Dom2 on 2003-06-13T07:30:54

Unfortunately, Unicode and perl still isn't as good as it should be. I've had lots of problems too.

My current favourite is POSTing XML to a server using lwp. You send the XML, it looks fine from the client, but when the server reads it in, it's got the final few characters chopped off. Why? Because when LWP is calculating the Content-Length header, it's getting the length in characters not bytes. So you have to make sure that you convert to bytes before you use LWP to send information across a network. Bah!

What's even more annoying about talking with LWP is the fact that the fix is different for 5.6 to 5.8. In 5.6.1, you use pack/unpack. In 5.8, I use encode_utf8.

The other real nuisance we've had is DBD::Pg. When it returns strings from the database, they don't have the UTF-8 bit turned on. So you end up with doubly encoded errors when you try to output them. I've put a patch into DBD::Pg that lets you fix this for now, but it's not a particularly pretty solution (it should detect the database encoding and use that).

I have no idea what state the other DBD:: modules are in. When I was trying to get DBD::Pg working, I took a look at a couple and didn't see any calls to SvUTF8_on() or similiar, so I suspect that it's not handled.

More generally, perl 5.6.1 tended to hide unicode problems from you because it didn't have the IO layers that 5.8 does. Because you can tell 5.8 you're going to be sending out UTF-8, you end up with all sorts of double-encoding bugs if you're not careful, most of which would not have happened under 5.6.1. This leads people to needlessly think that 5.8 is broken, when in fact it's the 3rd party software that is bust.

You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so. Trying to get that working is a lost cause, I feel, until people agree on how it's going to work. At present, there's no way to indicate what character set is in use in a URI.

The next area I want to look at is getting UTF-8 correctly from a POST request (GET's are unlikely to work, given the above para). I have no idea how to force a client to give us stuff in the correct character encoding. And then I have no idea how to make Apache::Request or CGI do the right thing. sigh. More hard work, which thankfully I've been able to avoid until now.

-Dom

Re:That's a good list but it's just the start...

IlyaM on 2003-06-13T12:27:45

You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so.

HTML 4.01 spec says:

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases: 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).
If you enter non-ASCII chars both latest versions of IE and Opera encode it correctly (i.e. by converting them in UTF-8 first and converting each byte to %HH). Mozilla 1.0 doesn't do it (I have not tried latest releases yet).

And then I have no idea how to make Apache::Request or CGI do the right thing.

I use my own wrapper of Apache::Request:

package Datamodel::Request;

use strict;
use warnings;

use base qw(Apache::Request);

use Datamodel::Tools qw(utf8_upgrade);

sub new {
    my $class = shift;
    my $self = bless $class->SUPER::new(@_), $class;
    return $self;
}

sub uri {
    my $self = shift;
    return utf8_upgrade($self->SUPER::uri(@_));
}

sub param {
    my $self = shift;

    my @ret = utf8_upgrade($self->SUPER::param(@_));
    if(wantarray) {
        return @ret;
    } else {
        return @ret > 1 ? [ @ret ] : $ret[0];
    }
}
Datamodel::Tools::utf8_upgrade is a sub that converts byte string which contains UTF-8 text into native UTF-8 Perl string. I think it can be replaced with one of subroutines from utf8 module but I have not tried it (part of this code was written before I decided it is a waste of time trying to workaround unicode problems in 5.6.1 and utf8 subroutines are only available in 5.8.0)

Re:That's a good list but it's just the start...

Dom2 on 2003-06-13T14:46:53

Mmmm, thank you! That's most helpful. Damned annoying about mozilla not doing the right thing though...

-Dom

Re:That's a good list but it's just the start...

IlyaM on 2003-06-13T16:01:13

Mozilla is only half broken. If (X)HTML document has UTF-8 encoding (what is indicated by Content-Type headers) than Mozilla encodes URLs in (X)HTML page correctly. What it doesn't (and what is most annyoing IMHO) is encoding correctly URLs manually entered in location bar. At least is it so in 1.0 which I still use.