Perl/UTF Madness

jk2addict on 2005-03-20T22:53:57

OK, this one has me stumped. I have a solution, but I want to know why I have to use it.

I've got a simple little method that returns the output from Locale::Currency::Format::urrency_symbol:

sub symbol { my ($code, $options) = @_;

$code ||= 'USD'; $options ||= 'SYM_UFT';

eval '$options = ' . $options;

return currency_symbol($code, $options); };

The output of this method is getting output into AxKit. Let's assume I'm going to ask for the JPY symbol (Yen). Under perl 5.6.1, I get the expected symbol. No use of 'use utf8' in my module or in L::C::F. Everyone is happy.

Under 5.8.4 however, all I get is a stinking ?. After some tinkering, this fix make the yen symbol show up under 5.8.4 too:

use utf8; ... sub symbol { my ($code, $options) = @_;

$code ||= 'USD'; $options ||= 'SYM_UFT';

eval '$options = ' . $options;

my $symbol = currency_symbol($code, $options); utf8::upgrade($symbol);

return $symbol; };

Now, my question is for someone who is intimate with the perl internal.. WHY? :-)

Upon the adivce of a fellow PerlMonk, I did a Devel::Peek dump of the scalar returned by currency_symbol. The first two are with no magic, the 3rd is with the fix under 5.8.4:

-------------- 5.6.1 -------------- SV = PV(0x14045dc) at 0x1409e8c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x142d9fc "\302\245"\0 CUR = 2 LEN = 3

-------------- 5.8.4 -------------- SV = PV(0x44c3d64) at 0x10590f4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x450ab24 "\245"\0 CUR = 1 LEN = 2

----------------- 5.8.4 w/ upgrade ----------------- SV = PV(0x44f91dc) at 0x104d644 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x4518aa4 "\302\245"\0 [UTF8 "\x{a5}"] CUR = 2 LEN = 3

So, now what? Is my fix appropariate? I imagine some level of tinkering withg the L::C::Format source would yield a fix as well, but that's not really an option to expect everyone to go through that.

Here's my guess: http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Byte-and-Character-Semantics

"However, as an interim compatibility measure, Perl aims to provide a safe migration path from byte semantics to character semantics for programs. For operations where Perl can unambiguously decide that the input data are characters, Perl switches to character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics."

So it's can guess well in 5.8 with \x{}, su I have to give it the hint.


I have guess

autarch on 2005-03-21T02:01:52

Sounds like the Locale::Foo::Bar module is returning raw bytes not marked as UTF8, something like you'd get from doing chr(254) . chr(76). In 5.8.x UTF8-land, you want to do something like chr(2687) or "\x{2A12}". The difference in the latter two is that when you do it that way (in 5.8.x+) Perl _knows_ that the string is UTF8, so things like regexes, length(), substr() and so on operate at a character level, not a byte level.

Re:I have guess

jk2addict on 2005-03-21T14:49:29

That's what confuses me. Locale::Currency::Format appears to be returning me the symbol from a private array of its' using \x{00a5}. It should just work, but it doesn't in 5.8.4 ... 5.6.1 is just happy.

In either case, utf8:upgrade fixes the problem for me since I can't rely on the installs of Locale::Format::Currency to do the right thing... whatever that is in 5.8.4