"You dropped a BOM on me, baybay!"

TorgoX on 2002-03-13T08:08:31

Dear Log,

When write-opening to a Unicode file, it is behoovy to then emit a byte-order mark before anything else:

print OUT "\x{feff}"; # Byte Order Mark

It even works happily with UTF8 files -- many applications correctly interpret as the resulting byte sequence as meaning "YES, THIS IS UTF8!".


Something fishy

Matts on 2002-03-13T10:49:02

I don't see how that would work on UTF-8 files, since the UTF-8 BOM is 0xEF 0xBB 0xBF (very rarely used since it tends to muck things up somewhat). The ONLY thing 0xFE 0xFF can indicate is network byte order UTF-16.

Re:Something fishy

TorgoX on 2002-03-13T21:33:17

Note that I have \x{feff} and not \xfe\xff. \x{feff} means "character FEFF", which usually (i.e., until we have more and varied handle disciplines) gets expressed as UTF8.

> perl -e "print map sprintf(q{%02x },$_), unpack q{C*}, qq{\x{feff}}"
ef bb bf

Re:Something fishy

Matts on 2002-03-13T23:58:42

Ah! You caught me napping. Nice one ;-)

Re:Something fishy

TorgoX on 2002-03-14T04:56:28

BTW, when does a UTF8 DOM screw things up?

Re:Something fishy

Matts on 2002-03-14T10:00:35

Never. But a UTF-8 B.O.M. screws things up ;-). For example on POSIX systems it screws up the shebang line, and also screws up interaction with the file magic type command. This is according to the UTF-8/Unicode on Linux FAQ.

"feff"

gnat on 2002-03-13T14:52:27

It sounds like it should be a euphemism for a swearword.

"You feffin' kids take your feffin' skateboards and your feffin' boomboxes somewhere else before I rip your feffin' heads open and take a feffin' coredump in the 0xdeadbeef inside!"

--Nat