Perl UTF-8 and latin-1 woes

miyagawa on 2006-09-08T05:49:58

I've been thinking that I fully understand the Perl UTF-8 flag and Unicode stuff very well, with the professional experience handling I18N and L10N issues with Perl for more than 5 years.

But it turns out that I still have something to learn, or things I've learned recently at least.

So here's the code.

#!/usr/bin/perl
use strict;
use warnings;
use Encode;
use File::Temp qw(tempfile);

use XML::RSS; use XML::RSS::LibXML; use XML::Atom::Feed; use Test::More 'no_plan';

$XML::Atom::ForceUnicode = 1; $XML::Atom::DefaultVersion = "1.0";

my %data; $data{latin1} = "Diction" . chr(225) . "rios"; $data{utf8} = "Diction" . "\xc3\xa1" . "rios"; $data{unicode} = decode_utf8($data{utf8});

my %code = ( 'XML::RSS' => \&test_xml_rss, 'XML::RSS::LibXML' => \&test_xml_rss_libxml, 'XML::Atom' => \&test_xml_atom, );

for my $module (qw(XML::RSS XML::RSS::LibXML XML::Atom)) { for my $label (qw(latin1 utf8 unicode)) { $code{$module}->($data{$label}, $label); } }

sub is_same { my($str1, $str2) = map _unicode($_), @_[0..1]; is $str1, $str2, pop(@_); }

sub _unicode { my $str = shift; return $str if utf8::is_utf8($str); return Encode::decode_utf8($str) if $str =~ /\xc3/; return Encode::decode('latin-1', $str); }

sub test_xml_rss { my($string, $label) = @_;

my $rss = XML::RSS->new; $rss->channel(title => $string);

my $xml = $rss->as_string; diag "XML::RSS + $label: is_utf8() = ", utf8::is_utf8($xml) ? 1 : 0;

$rss = XML::RSS->new; eval { my $tmp = write_file($xml); $rss->parsefile($tmp); is_same $rss->channel->{title}, $string, "XML::RSS $label"; }; fail "XML::RSS $label" if $@; }

sub test_xml_rss_libxml { my($string, $label) = @_;

my $rss = XML::RSS::LibXML->new; $rss->channel(title => $string);

my $xml = $rss->as_string; diag "XML::RSS::LibXML + $label: is_utf8() = ", utf8::is_utf8($xml) ? 1 : 0;

$rss = XML::RSS::LibXML->new; eval { my $tmp = write_file($xml); $rss->parsefile($tmp); is_same $rss->channel->{title}, $string, "XML::RSS::LibXML $label"; }; fail "XML::RSS::LibXML $label" if $@; }

sub test_xml_atom { my($string, $label) = @_;

my $feed = XML::Atom::Feed->new; $feed->title($string);

my $xml = $feed->as_xml; diag "XML::Atom + $label: is_utf8() = ", utf8::is_utf8($xml) ? 1 : 0;

eval { my $tmp = write_file($xml); $feed = XML::Atom::Feed->new($tmp); is_same $feed->title, $string, "XML::Atom $label"; }; fail "XML::Atom $label" if $@; }

sub write_file { my $data = shift; my($fh, $name) = tempfile(CLEANUP => 1); print $fh $data; close $fh; return $name; }


8 out of 9 tests will fail. It's because XML::Atom and XML::RSS::LibXML's output method (as_string() and as_xml() specifically) returns UTF-8 flagged string, regardless of what the input data was. So, if and only if the string contains of characters less than 255 (= latin-1 range), perl will print them in latin-1, if we don't supply the encoding explicitly.

It doesn't happen if the characters contain Unicode characters larger than 255, which is quite annoying, in terms of consistency.

Obviously, the fix is to add:

binmode $fh, ":utf8";


before actually printing the XML to the file. Or Encode::encode_utf8 and other equivalent stuff. What makes things a bit worse is that there's no documentation (in XML::RSS::LibXML and XML::Atom) if the output data is utf-8 flagged or not. Even worse, XML::RSS output data may be utf-8 flagged or not, depending on the input.

As an author of XML::Atom, I was about to change the as_xml() implementation to force UTF-8 binary output, rather than UTF-8 flagged string. But I hesitate to push the code now, since it *might* break the backward compatibility. There could be some code that expects $feed->as_xml return Unicode string and open the filehandle with utf8 mode. That way, the users will get the double utf-8 encoded string.

I'm chatting this about Daisuke, the author of XML::RSS::LibXML and agreed it's all about documentation, or probably add another option to force UTF-8 binary output.