Encoding woes

gav on 2004-08-27T17:03:42

Somehow I ended up with a string containing &#147;Foo&#148; in a database (these are windows-1252 smart quotes). This then ended up in an XML file which had a declaration of <?xml version="1.0" encoding="UTF-8"?> but was being served with a HTTP Content-Type header of "text/xml; charset=iso-8893-1" due to a misunderstanding with CGI::Simple.

Strangely enough, it seemed to work in both FireFox and Internet Explorer. FireFox showed the smart quotes but IE chose to show the empty squares denoting some kind of bad character. The issue was then saving the XML to a file and re-opening it. IE was now chosing to point out that the broken XML was actually broken, but FireFox still seemed happy. FireFox was saving the file without the declaration and turning the broken characters in &#8220; and &#8221;. IE chose to decode the characters from windows-1252 and save them, thus with an UTF-8 declaration causing an error.

Using some code like Jacques Distler's StripControlChars MT Plugin, I fixed up the characters to UTF-8, fixed the header, and everybody was happy.

It seems that even though FireFox is trying to do the right thing, it's broken. The whole problem was caused by a bunch of seperate broken things all trying their best to work.


Character Set

Dom2 on 2004-08-28T09:23:18

Mark Pilgrim wrote an essay about getting the character set correct for XML over HTTP. Unfortunately even though XML makes dealing with character sets a bit more explicit, it's still got enough areas of pain to be a bother. Particularly when you find out things like all characters in an XML document are represented by a Unicode code point regardless of the source input encoding, except that some code points are specifically barred. Including U+0080 to U+009F, which is what you're looking at. Gah.

-Dom