Perl UTF-8 and latin-1 woes

miyagawa on 2006-09-08T05:49:58

I've been thinking that I fully understand the Perl UTF-8 flag and Unicode stuff very well, with the professional experience handling I18N and L10N issues with Perl for more than 5 years.

But it turns out that I still have something to learn, or things I've learned recently at least.

So here's the code.

#!/usr/bin/perl
use strict;
use warnings;
use Encode;
use File::Temp qw(tempfile);

use XML::RSS;
use XML::RSS::LibXML;
use XML::Atom::Feed;
use Test::More 'no_plan';

$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";

my %data;
$data{latin1}  = "Diction" . chr(225) . "rios";
$data{utf8}    = "Diction" . "\xc3\xa1" . "rios";
$data{unicode} = decode_utf8($data{utf8});

my %code = (
    'XML::RSS' => \&test_xml_rss,
    'XML::RSS::LibXML' => \&test_xml_rss_libxml,
    'XML::Atom' => \&test_xml_atom,
);

for my $module (qw(XML::RSS XML::RSS::LibXML XML::Atom)) {
    for my $label (qw(latin1 utf8 unicode)) {
        $code{$module}->($data{$label}, $label);
    }
}

sub is_same {
    my($str1, $str2) = map _unicode($_), @_[0..1];
    is $str1, $str2, pop(@_);
}

sub _unicode {
    my $str = shift;
    return $str if utf8::is_utf8($str);
    return Encode::decode_utf8($str) if $str =~ /\xc3/;
    return Encode::decode('latin-1', $str);
}

sub test_xml_rss {
    my($string, $label) = @_;

    my $rss = XML::RSS->new;
    $rss->channel(title => $string);

    my $xml = $rss->as_string;
    diag "XML::RSS + $label: is_utf8() = ",  utf8::is_utf8($xml) ? 1 : 0;

    $rss = XML::RSS->new;
    eval {
        my $tmp = write_file($xml);
        $rss->parsefile($tmp);
        is_same $rss->channel->{title}, $string, "XML::RSS $label";
    };
    fail "XML::RSS $label" if $@;
}

sub test_xml_rss_libxml {
    my($string, $label) = @_;

    my $rss = XML::RSS::LibXML->new;
    $rss->channel(title => $string);

    my $xml = $rss->as_string;
    diag "XML::RSS::LibXML + $label: is_utf8() = ",  utf8::is_utf8($xml) ? 1 : 0;

    $rss = XML::RSS::LibXML->new;
    eval {
        my $tmp = write_file($xml);
        $rss->parsefile($tmp);
        is_same $rss->channel->{title}, $string, "XML::RSS::LibXML $label";
    };
    fail "XML::RSS::LibXML $label" if $@;
}

sub test_xml_atom {
    my($string, $label) = @_;

    my $feed = XML::Atom::Feed->new;
    $feed->title($string);

    my $xml = $feed->as_xml;
    diag "XML::Atom + $label: is_utf8() = ",  utf8::is_utf8($xml) ? 1 : 0;

    eval {
        my $tmp = write_file($xml);
        $feed = XML::Atom::Feed->new($tmp);
        is_same $feed->title, $string, "XML::Atom $label";
    };
    fail "XML::Atom $label" if $@;
}

sub write_file {
    my $data = shift;
    my($fh, $name) = tempfile(CLEANUP => 1);
    print $fh $data;
    close $fh;
    return $name;
}

8 out of 9 tests will fail. It's because XML::Atom and XML::RSS::LibXML's output method (as_string() and as_xml() specifically) returns UTF-8 flagged string, regardless of what the input data was. So, if and only if the string contains of characters less than 255 (= latin-1 range), perl will print them in latin-1, if we don't supply the encoding explicitly.

It doesn't happen if the characters contain Unicode characters larger than 255, which is quite annoying, in terms of consistency.

Obviously, the fix is to add:

binmode $fh, ":utf8";

before actually printing the XML to the file. Or Encode::encode_utf8 and other equivalent stuff. What makes things a bit worse is that there's no documentation (in XML::RSS::LibXML and XML::Atom) if the output data is utf-8 flagged or not. Even worse, XML::RSS output data may be utf-8 flagged or not, depending on the input.

As an author of XML::Atom, I was about to change the as_xml() implementation to force UTF-8 binary output, rather than UTF-8 flagged string. But I hesitate to push the code now, since it *might* break the backward compatibility. There could be some code that expects $feed->as_xml return Unicode string and open the filehandle with utf8 mode. That way, the users will get the double utf-8 encoded string.

I'm chatting this about Daisuke, the author of XML::RSS::LibXML and agreed it's all about documentation, or probably add another option to force UTF-8 binary output.