Lately I've been working on a script to parse and classify mails that come in after a bulk mail has been sent, most of them in the form of bounces. Of the roughly 2000 mails, 2 had corrupt headers. Guess where they both originated from? Oh, yeah, I already told you in the post title: hotmail.com.
The problem with these two mails is that in the middle of the mail headers, there's a blank line, followed by a line starting with "From: ", thus, apart from the 3 garbage characters, it's the real "From:" line.
You would expect that a huge company under the umbrella would be capable of getting their stuff right. I think it's quite typical that they don't. Can't. Won't.
Can anybody explain what the origin of this garbage could be? I have no idea. In Perl, you can match it with /\357\273\277/
.
Rewriting in hex notation makes those bytes a little more familiar:
$ perl -we 'printf "%02x %02x %02x\n", 0357, 0273, 0277'
ef bb bf
If that’s still not familiar, Encode might help:
$ perl -MEncode -we 'printf "U+%04X\n", ord decode_utf8("\xef\xbb\xbf")'
U+FEFF
And what’s U+FEFF?
$ perl -MUnicode::UCD=charinfo -lwe 'print charinfo(0xFEFF)->{name}'
ZERO WIDTH NO-BREAK SPACE
It’s a zero-width non-breaking space, also known as a “byte-order mark”. At the start of a document, a zero-width non-breaking space has no visual effect, so it was originally intended to allow programmatic distinction of little-endian and big-endian 16-bit encodings of Unicode. (There’s guaranteed to be no Unicode character with the codepoint U+FFFE, so it’s safe to use it in that way.) Eventually the same technique got applied to UTF-8, despite the fact that it doesn’t typically provide any benefits under UTF-8, and is often actively harmful.
So it seems that Hotmail’s server sometimes generates bounces that both are inappropriately in a non-US-ASCII encoding, and also inappropriately begin with a byte-order mark. This is what would be technically described as a bug.
The Wikipedia article on byte-order marks may be helpful.
Re:Origin of those bytes
Aristotle on 2007-05-15T03:07:44
Pah, you beat me to the punch.
:-) Re:Origin of those bytes
bart on 2007-05-19T10:10:48
Sheesh, a UTF8 BOM marker, I'd never have thought in that direction.despite the fact that it doesn’t typically provide any benefits under UTF-8Well, it's a marker that the following text is in UTF8. So it may be somewhat useful, though of limited use.
Of course, it is completely out of place in mail headers.
But I'm still wondering about the extra newline in front of the mangledFrom/code> header. Did Hotmail put it there, or did an intermediate SMTP server see the mangled header, and separated it from the rest of the other, properly formed, headers above it?