List of minor and not very minor annoynances I've meet in last ~3 days while doing conversion to UTF-8 of the project I'm working on:
My current favourite is POSTing XML to a server using lwp. You send the XML, it looks fine from the client, but when the server reads it in, it's got the final few characters chopped off. Why? Because when LWP is calculating the Content-Length header, it's getting the length in characters not bytes. So you have to make sure that you convert to bytes before you use LWP to send information across a network. Bah!
What's even more annoying about talking with LWP is the fact that the fix is different for 5.6 to 5.8. In 5.6.1, you use pack/unpack. In 5.8, I use encode_utf8.
The other real nuisance we've had is DBD::Pg. When it returns strings from the database, they don't have the UTF-8 bit turned on. So you end up with doubly encoded errors when you try to output them. I've put a patch into DBD::Pg that lets you fix this for now, but it's not a particularly pretty solution (it should detect the database encoding and use that).
I have no idea what state the other DBD:: modules are in. When I was trying to get DBD::Pg working, I took a look at a couple and didn't see any calls to SvUTF8_on() or similiar, so I suspect that it's not handled.
More generally, perl 5.6.1 tended to hide unicode problems from you because it didn't have the IO layers that 5.8 does. Because you can tell 5.8 you're going to be sending out UTF-8, you end up with all sorts of double-encoding bugs if you're not careful, most of which would not have happened under 5.6.1. This leads people to needlessly think that 5.8 is broken, when in fact it's the 3rd party software that is bust.
You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so. Trying to get that working is a lost cause, I feel, until people agree on how it's going to work. At present, there's no way to indicate what character set is in use in a URI.
The next area I want to look at is getting UTF-8 correctly from a POST request (GET's are unlikely to work, given the above para). I have no idea how to force a client to give us stuff in the correct character encoding. And then I have no idea how to make Apache::Request or CGI do the right thing. sigh. More hard work, which thankfully I've been able to avoid until now.
-Dom
Re:That's a good list but it's just the start...
IlyaM on 2003-06-13T12:27:45
You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so.HTML 4.01 spec says:
If you enter non-ASCII chars both latest versions of IE and Opera encode it correctly (i.e. by converting them in UTF-8 first and converting each byte to %HH). Mozilla 1.0 doesn't do it (I have not tried latest releases yet).We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases: 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).And then I have no idea how to make Apache::Request or CGI do the right thing.
I use my own wrapper of Apache::Request:
Datamodel::Tools::utf8_upgrade is a sub that converts byte string which contains UTF-8 text into native UTF-8 Perl string. I think it can be replaced with one of subroutines from utf8 module but I have not tried it (part of this code was written before I decided it is a waste of time trying to workaround unicode problems in 5.6.1 and utf8 subroutines are only available in 5.8.0)package Datamodel::Request;
use strict;
use warnings;
use base qw(Apache::Request);
use Datamodel::Tools qw(utf8_upgrade);
sub new {
my $class = shift;
my $self = bless $class->SUPER::new(@_), $class;
return $self;
}
sub uri {
my $self = shift;
return utf8_upgrade($self->SUPER::uri(@_));
}
sub param {
my $self = shift;
my @ret = utf8_upgrade($self->SUPER::param(@_));
if(wantarray) {
return @ret;
} else {
return @ret > 1 ? [ @ret ] : $ret[0];
}
}Re:That's a good list but it's just the start...
Dom2 on 2003-06-13T14:46:53
Mmmm, thank you! That's most helpful. Damned annoying about mozilla not doing the right thing though...-Dom
Re:That's a good list but it's just the start...
IlyaM on 2003-06-13T16:01:13
Mozilla is only half broken. If (X)HTML document has UTF-8 encoding (what is indicated by Content-Type headers) than Mozilla encodes URLs in (X)HTML page correctly. What it doesn't (and what is most annyoing IMHO) is encoding correctly URLs manually entered in location bar. At least is it so in 1.0 which I still use.