Repairing broken documents that mix UTF-8 and ISO-8859-1

Aristotle on 2006-04-06T19:54:02

A perpetual (if thankfully not too frequent) problem on the web are documents claiming to be encoded in either UTF-8 or ISO-8859-1, but containing characters encoded according to the respective other charset. Such documents will display incorrectly, regardless of which way you look at them. Worse, if the document in question is XML (such as, say, a newsfeed) and claims to be encoded in UTF-8, upset ensues that leads the XML parser to halt and catch fire as soon as it encounters the first invalid byte.

How does it know? It does because UTF-8 has a very specific way of encoding non-ASCII characters. Encoding non-ASCII characters according to ISO-8859-1 violates this scheme, so their presence is detectable with a very high degree of confidence.

Of course, this can just as soon be used to good advantage. If you start with the working assumption that the primary encoding of a confusedly encoded document is UTF-8, and merely decode and re-encode the byte stream, you can salvage misencoded data by catching any character decoding errors and decoding the offending invalid bytes as ISO-8859-1.

Here’s a Perl script, cleverly called repair-utf8, which implements this approach:

#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( decode FB_QUIET );

binmode STDIN, ':bytes';
binmode STDOUT, ':utf8';

my $out;

while( <> ) {
	$out = '';
	while( length ){
		$out .= decode( "utf-8", $_, FB_QUIET );
		$out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
	}
	print $out;
}

The only non-obvious bit to be aware of here is that when using the FB_QUIET fallback mode, Encode will remove any successfully processed data from the input buffer. The entire script revolves around this behaviour. After the first decode, $_ will be empty if it was successfully decoded. If not, the successfully decoded part at the start of $_ will be returned, and $_ will be truncated from the front up to the offending byte. The second decode is then free to process that. The inner loop will keep running as long as any undecoded input is left, decoding it, if need be, one byte at a time as ISO-8859-1.


or go for the jugular

jhi on 2006-04-07T16:16:44

s/(?<![\xC2\xC3])([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; # If chars > 0xFF extend appropriately.

Re:or go for the jugular

Aristotle on 2006-04-07T21:24:41

Yeah, that’s an option. Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good. I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints. But yeah, I get your point.

It would just take a lot of concentrated effort to ensure 100% correctness when taking that route, and I couldn’t be bothered with the bitfiddle this time (unlike that other time when I wrote codepoint-to-UTF-8 math in XPath within XSLT of all things). That code up there took 3 minutes to write once I found the right fallback in the Encode docs, and I know it’s correct.

But I might do it the hard way anyway at some other time.

Re:or go for the jugular

jhi on 2006-04-08T07:22:52

: Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good.

My regex replaces high-bit-set bytes that cannot be the trailing bytes of validly UTF-8 encoded Latin-1 high-bit-bytes with their UTF-8 bytes. What is invalid and what is not depends on your definition: is this

0xC3 0xBF

meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF

LATIN SMALL LETTER Y WITH DIAERESIS

or as two characters

LATIN CAPITAL LETTER A WITH TILDE (U+00C3 == 0xC3)

followed by

INVERTED QUESTION MARK (U+00BF == 0xBF)

In other words, which interpretation gets to go first.

Note that UTF-8 was purposefully engineered so that legal UTF-8 is highly unlikely (based on statistical sampling of existing legacy 8-bit texts) to be valid/sensible legacy 8-bit text.

Since your input data is essentially corrupt (it is full of invalid UTF-8 sequences) you are going to end up with a best guess strategy which ever way you choose to go.

If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

: I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints.

Latin-1 codepoints 0x00..0xff do correspond to Unicode codepoints U+0000..U+00FF 1:1, 100%, completely, fully, without doubt.

Re:or go for the jugular

Aristotle on 2006-04-08T12:48:26

If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

Ah, so that’s the assumption in your regex that I was vaguely aware of. Indeed, I cannot ignore codepoints beyond U+00FF. In particular, Unicode curly quotes (U+2019, U+201C, U+201D) and en- and em-dashes (U+2013, U+2014) are ubiquitous (and not at all hard for users to type), so I must assume that my UTF-8 data will contain many more high-bit-set byte values than just 0xC2/0xC3 that are still part of valid multibute sequences.

As for your other question:

is 0xC3 0xBF meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF or as two characters U+00C3 followed by U+00BF

It is to be interpreted as the UTF-8 encoding of U+00FF, as implied by my saying that the working assumption is that the primary encoding is UTF-8. In fact, this is the only way around which makes sense, because there are no invalid Latin-1-encoded sequences. You must assume UTF-8 so that you can actually have invalid sequences – of which you then conclude that they must be Latin-1 bytes.

That’s the entire point of my post, actually.

I’ve not tested the algorithm very widely so far, but so far it has been very accurate with whatever data I’ve thrown at it.

Note that UTF-8 was purposefully engineered so that legal UTF-8 is highly unlikely (based on statistical sampling of existing legacy 8-bit texts) to be valid/sensible legacy 8-bit text.

Oh, I know. It’s a marvel of design. Variable-width encodings suffer some inherent suckage, but UTF-8 pays this price in return for huge gains. It’s astonishingly clever and beautiful.