Decoding multiple encoded utf-8 in perl or ruby

jjore on 2010-01-07T03:57:25

I'd recently encountered some data that had been re-encoded five times. Ugh. The key was to guess that if a character 0xC0-0xFF is followed by 0x80-xBF, it's likely that the bytes are actually utf-8. What follows is a function which guessed a reasonable way to deal with the data and turn it into the right utf-8.

use Test::More tests => 1;
use Encode ();

my $bad = "\xc3\x83\xc2\x83\xc3\x82\xc2\x83\xc3\x83\xc2\x82\xc3\x82\xc2\x83\xc3\x83\xc2\x83\xc3\x82\xc2\x82\xc3\x83\xc2\x82\xc3\x82\xc2\xa9";
my $good = "\xe9";
is( multiple_decode( $bad ), $good );

sub multiple_decode {
  my ( $str ) = @_;

  Encode::_utf8_on( $str );
  while ( $str =~ /[\xc0-\xff][\x80-\xbf]/ ) {
    utf8::downgrade( $str );
    Encode::_utf8_on( $str );
  }

  return $str;
}

I tried doing this in Ruby because I actually needed this for an EventMachine (http://rubyeventmachine.com) server but never quite got it working. Iconv seemed to want to be strict about rejecting the originally ostensibly invalid input.

# Iconv::IllegalSequence: "\303\203\302\203\303\202\302\203\303\203\302\202\303\202\302\203"...

require 'test/unit'
require 'iconv'

class MDecode < Test::Unit::TestCase
  def test_multiple_decode
    conv = Iconv.new( 'UTF-8', 'ASCII' )

    bad = "\xc3\x83\xc2\x83\xc3\x82\xc2\x83\xc3\x83\xc2\x82\xc3\x82\xc2\x83\xc3\x83\xc2\x83\xc3\x82\xc2\x82\xc3\x83\xc2\x82\xc3\x82\xc2\xa9"
    good = "\xe9"

    assert_equal( good, conv.multiple_decode( bad ) )
  end
end

class Iconv
  def multiple_decode( str )
    while str =~ /[\xc0-\xff][\x80-\xbf]/
      str = iconv( str )
    end

    return str
  end
end

That code is painful to read

Aristotle on 2010-01-07T07:44:06

Why in Larry’s name are you fiddling with the UTF8 flag at all? And for what, turning the flag on and then downgrading the string? That’s pure obfuscation.

sub multiple_decode { my ( $str ) = @_; utf8::decode $str while $str =~ /[\xc0-\xff][\x80-\xbf]/; return $str; }

Re:That code is painful to read

jjore on 2010-01-09T05:59:04

That doesn't work. I didn't know it was supposed to.

Re:That code is painful to read

jjore on 2010-01-09T06:04:41

Actually, that would apparently work if the value being decoded were fully valid but it doesn't because the input was an abuse of Unicode.

Re:That code is painful to read

Aristotle on 2010-01-09T10:34:17

It is indeed supposed to work.
So what does your data look like, then? Does it contain a mixture of encoding levels at once?

Re:That code is painful to read

jjore on 2010-01-10T22:36:05

Ah. Ok, it works provided the value is valid UTF-X, Perl's more permissive variant of UTF-8. When I'd tried your snippet I copied my original post but the rendered blog post had a space inserted into the middle of the string which made the value no longer be valid UTF-X.
Also, utf8::decode returns a boolean indicating whether it did anything. Presumable this means your function should read as follows. This leaves both the interpretation of the string up to Perl and also lets us eventually abort when there's no more work to be done according to perl.
1 while utf8::decode( $str );

As a note, utf8::decode is implemented by sv_utf8_decode of sv.c which will abort when the string stops being valid UTF-X.

Iconv

Hansen on 2010-01-07T12:25:19

Iconv can't transcode your data to US-ASCII since it contains octets greater than 0x7F. Your double encoded data has been transcoded from Latin-1 to UTF-8, in order to reverse it you need to transcode from UTF-8 to Latin-1.

Change:
Iconv.new( 'UTF-8', 'ASCII' )

To:
Iconv.new( 'LATIN1', 'UTF-8' )

and it should work just fine.

You can aslo narrow down your regexp to [\xc2-\xc3][\x80-\xbf], since UTF-8 encoded Latin-1 is within that range.

--
chansen

Encode::Repair

bart on 2010-01-07T19:31:21

Earlier today, moritz uploaded a new module, Encode::Repair, to CPAN, which has fixing this kind of trouble in mind. But 5 times? That's a bit steep...