HTTP::Response::Charset

miyagawa on 2006-10-07T21:22:43

http://svn.bulknews.net/repos/public/HTTP-Response-Charset/trunk

So I created a module HTTP::Response::Charset, which detects a charset of HTTP response using various techniques (Content-Type, META tag, BOM, XML declaration and Encode::Detect). The motivation is to get correctly decoded Unicode string from any HTTP response, especially text/html, text/plain, XHTML and RSS/Atom.

The POD document has most of what I'd like to say, so go ahead and take a look at it. Also see Unit test suite using Test::Base to see the expected behavior.

After I wrote this code I google codesearched for a little and found that HTTP::Message, the base class of HTTP::Response has the exact decode_content() method which I wanted to implement using charset() value. Ugh.

Fortunately or unfortunately, current decode_content is slightly different from what I wanted to do. decode_content first decodes content body based on Content-Encoding header, like gzip, deflate or quoted-printable. Then, if the Content-Type is text/*, decode the content using charset value set in the header (or META tag if the response is HTML).

This might not be good enough for some corner cases:

1) If there's no charset= set in Content-Type nor META tag, it tries to decode text as latin-1 by default and gives corrupted Unicode data. (You can avoid that by saying $res->decoded_body(default_charset => 'none'), though)

2) It does Unicode decoding only for text/* response, which means if the response is application/xhtml+xml or application/atom+xml, it doesn't.

(Note that I'm not saying this is a bug. For XML data you don't need to decode the text portion by yourself, since most XML parser detects the encoding when it processes XML declaration and adds UTF-8 flag internally)

Update: per Matts this is a bug. Should I better send a patch to Gisle to decode when Content-Type matches application/(*+)xml?

So I hope this module supplements the situation. For 1), you can pass

$res->decode_content(charset => $res->charset)


to deal with the HTTP response without charset set. For 2), You can just say

Encode::decode($res->charset, $res->content);


for whatever MIME types you'd like to decode.

However, to decode gzip encoded and BOMed XML data correctly for instance, you need to write this way:

my $content = Encode::decode($res->charset, $res->decoded_content(charset => 'none'));


which looks a little kludgy.

I'm not sure if it's a good thing to hack (or extend) decoded_content method, or add another convenience method to do the right thing.

Any feedbacks welcome.


No, it is a bug...

Matts on 2006-10-07T21:29:05

(Note that I'm not saying this is a bug. For XML data you don't need to decode the text portion by yourself, since most XML parser detects the encoding when it processes XML declaration and adds UTF-8 flag internally)

This is a bug - for XML you're supposed to be able to declare the encoding in the protocol. It's part of the spec (but I'm too lazy to look it up right now).

Re:No, it is a bug...

miyagawa on 2006-10-07T21:40:39

Good to know and it'll be a good rationale for the module. Yeah, reading http://tools.ietf.org/html/rfc2376#page-10 assures me that the charset parameter is important for both text/xml and application/xml, and XML/MIME parsers should respect that.

Thanks!

Re:No, it is a bug...

Matts on 2006-10-07T22:53:24

Yeah it's one of the corners of the XML spec I'm never quite sure they did the right thing on, and I would doubt many XML processing systems respect, but that's what we have to live with :-)

Augh META

Aristotle on 2006-10-07T22:29:36

The META tag is a nasty hack. Originally, servers were supposed to parse the outgoing document, insert the given headers into the HTTP header, and drop the tag on the floor (yes, really). Instead, clients now parse the body and then retroactively pretend the META tags had been part of the HTTP header. That leads to various kinds of nastiness. The whole is ugly and nasty and painful.

Please only respect it when found in text/html content – in application/xhtml+xml you should never ever look at it.

Re:Augh META

miyagawa on 2006-10-08T03:14:57

I agree. In application/xhtml+xml we should just look at BOM or XML encoding declaration, then. Changing that would be an one-line fix. Thoughts?

Re:Augh META

Aristotle on 2006-10-08T04:03:09

In short, the rules for XML MIME types (m{application/(.*\+)?xml}) are that if the HTTP header says anything, it is authoritative; otherwise, the XML parser gets to decide from the byte stream. If you are planning to pre-decode XML content as a courtesy for people who may want to do something other than pass it to an XML parser, you should read the XML spec; it has a clear outline of the algorithm an XML parser uses to detect the encoding.

But if you do that, be aware that XML parsers will want to decode the stream on their own anyway, so to avoid double-decoding bugs you should remove the preamble on decoding, and you should probably also make the raw byte stream available.

Oh, and btw – defaulting to Latin1 is actually correct by the letter of the RFC for text/* MIME types. This is the major reason why text/xml is bad: if the HTTP header is silent, then an XML document served as text/xml must be decoded as Latin1 regardless of any present premable. That’s obviously crazy, so XML should always be served with application/* instead.

Re:Augh META

miyagawa on 2006-10-08T08:39:40

To be clear, what this module does is nothing new. This is a part of CPANization of our Plagger code to deal with real world feeds and HTML content. This code has been in my daily use against thousands of feeds and have been doing pretty good.

And I removed the META detection if MIME-type is application/xhtml+xml. Actually I could remove the entire META detection code, since it's already done in LWP::UserAgent (and LWP::Protocol) unless you call $ua->parse_head(0) explicitly.

If you are planning to pre-decode XML content as a courtesy for people who may want to do something other than pass it to an XML parser, you should read the XML spec

Yeah, the main motivation for this module is to properly decode (X)HTML content on the web to Unicode, before passing it to HTML::Parser, HTML::TreeBuilder or some regular expression based code. They all require the content decoded before reading, for us to use Unicode characters properly. (HTML::Parser document clearly recommends that and otherwise it gives you a warning: "Parsing of undecoded UTF-8 will give garbage when decoding entities")

But if you do that, be aware that XML parsers will want to decode the stream on their own anyway, so to avoid double-decoding bugs you should remove the preamble on decoding, and you should probably also make the raw byte stream available.

Sure, $res->content is always there to return the raw byte stream and we don't change the behavior at all. So far as I tested on Plagger development, most XML/YAML parsers are aware of the UTF-8 flag of the body and do the right thing: XML::LibXML, XML::Parser and YAML::Syck respectively.

Oh, and btw – defaulting to Latin1 is actually correct by the letter of the RFC for text/* MIME types.

http://tools.ietf.org/html/rfc2376#page-11 says it should default to "us-ascii", not "latin-1." Well, I won't be surprised since latin-1 can be considered as a super-set of us-ascii in some way.

Re:Augh META

Aristotle on 2006-10-09T15:24:14

Oh. I was referring to RFC 2616 (HTTP/1.1) Section 3.7.1, which says:

The “charset” parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the “text” type are defined to have a default charset value of “ISO-8859-1” when received via HTTP. Data in character sets other than “ISO-8859-1” or its subsets MUST be labeled with an appropriate charset value.

So RFC 3023 (which obsoletes 2376, and will in turn one day be obsoleted by draft-murata-kohn-lilley-xml) specifies us-ascii explicitly for text/xml, in spite of RFC 2616.

Ack. There’s so much crud to remember about charsets on the web. No wonder there’s so much breakage.

XML is not text

Juerd on 2006-10-08T23:02:32

XML is not text, it's binary data that looks like text. It needs to live in a binary string, not a text string.

While this binary data does have text in it, not all of it actually is. The text in the binary data is encoded, and the character set for this has to be given. Indeed, with the charset attribute in the Content-Type, or in the <?xml?> declaration.

You need to know the right encoding, and when it's not in the document itself, you need to pass it to your XML parser, so it knows how to handle the document.

BUT...

If you decode() the data, then you're in trouble. The string returned by Perl's decode() is a Perl text string, which is a unicode string, but not a UTF-8 string, not a ISO-8859-1 string, not a Windows-1252 string, etcetera. It no longer uses "encoding" semantics. A parser that doesn't know Perl can no longer correctly parse the XML file: it requires an encoding per spec, but the document is not encoded. And any parser that does know about Perl's text strings wouldn't really be XML compliant, because XML compliancy involves handling character encodings.

Because XML is most often UTF-8 encoded, and Perl's internal encoding *happens to be* UTF-8 at the moment, such bugs can easily go unnoticed for quite a long time.

You could depend on Perl's internal behaviour. This is wrong, but reliable because it won't change any time soon anyway. If you do hack at it like this, you will create other problems that you will have to hack around. For example, you have to remove charset information from the <?xml?> declaration, or replace it with UTF-8. But your instinct should tell you that modifying the document before parsing it for real is incredibly bad.

So, ignore that XML looks like text, and do NOT decode it yourself. Let the XML parser do this. I suggest that for your module, you write a method that returns a list of charset and undecoded body (or use two separate methods) and document that this is the recommended way of treating XML.

See also perlunitut and http://juerd.nl/perluniadvice

Sorry for being so brief and perhaps harsh. I hope we can quickly establish *good* examples and consistent user education, so I don't have to explain the same thing three times a day :)

Re:XML is not text

miyagawa on 2006-10-09T04:22:24

If you decode() the data, then you're in trouble. The string returned by Perl's decode() is a Perl text string, which is a unicode string, but not a UTF-8 string, not a ISO-8859-1 string, not a Windows-1252 string, etcetera.

Sure, I know. I maintain XML::Atom, some Encode modules and do I18N stuff every day with Perl 5.8 server side :)

And any parser that does know about Perl's text strings wouldn't really be XML compliant, because XML compliancy involves handling character encodings.

Agreed. As I said on the other comment, I'm not planned to pass Unidecoded string to XML parser anyway, but would like to pass it to HTML::Parser (XHTML) and some regular expression based code that requires Unicode string.

I suggest that for your module, you write a method that returns a list of charset and undecoded body (or use two separate methods) and document that this is the recommended way of treating XML.

undecoded body is always available as $res->content which this module doesn't touch anyway, but I could add the note to the module POD document. (Thanks)

Read http://use.perl.org/comments.pl?sid=33258&cid=50882 for more about the motivation and the goal of this module (if you haven't).

Re:XML is not text

Aristotle on 2006-10-09T15:38:18

But your instinct should tell you that modifying the document before parsing it for real is incredibly bad.

Oh? Why?

A major point of the preamble as specified is that intermediaries that need to mung the content be able to safely and reliably change the preamble to reflect their modifations. (F.ex., you might want to send XML over a transport that isn’t 8-bit-safe, in which case you can transcode the document to UTF-7 and not have to parse and re-emit it as US-ASCII with entities.)

Other than that, though, I completely agree with what you’re saying.

Re:XML is not text

Juerd on 2006-10-09T22:27:33

As I said, it's about instincts. So defining "why" will be hard ;)

But I do think it can be summarized, so that I can avoid the explicit why-question: to modify correctly, you need to parse. You can't parse before parsing, it's an infinite loop.

Re:XML is not text

Aristotle on 2006-10-09T22:55:50

An XML parser has to solve the same quandary: before it can parse the document, it has to decode it, but to decode it, it needs to know the encoding, which is specified in the preamble. Oops.

Well, no. The preamble is actually a highly restricted protocol that you can implement without having to parse more of the document. Read the XML spec – there is a very clear outline of the exact layout of the preamble in terms of the actual possible octet sequences. (It’s impressive how many niggles they covered in order to solve this problem completely.)

It follows that you can safely mangle it without having to parse the document. That is the whole point of the preamble.

Re:XML is not text

Juerd on 2006-10-09T23:23:49

Then again the XML specification and my instincts are in heavy disagreement. But this time it works out pretty well. Neat and interesting approach of things, but of course this is yet another XML thing that's nice in theory but hard to implement correctly.

I think I'll just keep passing XML around as I get it, without modifying it before it reaches the parser.

So far, I haven't used XML with non-XML parsers.

Re:XML is not text

Aristotle on 2006-10-10T00:10:04

Yeah, I wasn’t saying you should transcode the document (usually you shouldn’t) – just that if you find yourself needing to, you can do it.