Figuring out if text is UTF8

Lecar_red on 2005-02-25T20:46:38

Well for the last couple of days, I've struggled in figuring out how to have Perl tell me that the current string inside a scalar is actually UTF8 or something else.

The first thing I tried was using the internal 'utf::valid' command. Well according to this everything (including values I knew where shift jis) was valid utf8. Later, I found (in some very useful documentation) that this will only tell you what Perl is storing it as not if the value is actually UTF8. But thanks to a very nice entry in the perluniintro page, that you can figure out if something is utf8 by simple decoding it. If it doesn't work that the value is not utf8. The Encode module is useful for that.

One other bit I've learned working with UTF8, shift JIS and other character encodings. It pays to use test values in URI (or HTML escaped) strings, then you can unescape them before your test script (or main application code) messes with the string. Then you can escape them to prevent problems with older (or basic) terms (xterm, my redhat 7.2 machine, etc.). Must better than having to pipe the output to less or xod. Also, it makes it easy to grab a html escape value from a logfile and then pass it as a command line arg to your test script (with unescapes it).

Just a couple thoughts for the end of the week.

oops... I meant Perl ;)

It's Perl (or perl) not PERL, darnit

merlyn on 2005-02-25T22:04:38

No text needed here. See the subject line.