I need some Unicode examples for Effective Perl Programming

brian_d_foy on 2009-09-25T07:25:05

Josh McAdams and I are updating Effective Perl Programming, and I'm working on a bunch of items dealing with Unicode.

I need some really nice non-english and especially non-romance language examples for some of the features we want to discuss. I'd love to be able to include sample strings in Chinese, Japanese, Russian, Portugeuse, Arabic, and all sorts of other languages I have no clue about. Most of what I need are the sample phrases. If you don't have something interesting, maybe you can translate "Perl mongers" for me in an example like:

use utf8; 
my $phrase = '...'; # fill in your phrase

if( $phrase =~ m/\N{Some charname}/ ) { say 'I matched a ...'; }


I also want to add a couple of examples of other encodings, especially non-Western ones. I have no idea about those encodings, but I don't need anything fancy.

I'm sure that everything is going to get messed up and translated incorrectly, so I'll be sure to let you see the proofs of your example to ensure the typesetters get it right in the end. :)


My advice…

Dom2 on 2009-09-25T09:11:32

Seeing as you didn't ask for it. ;-)

Always use UTF-8 if you possibly can. It's (more-or-less) a superset of everything else, and it's properly detectable.

If you're looking for interesting encodings, I'd recommend checking out one of the Shift-JIS things. Just for weirdness. Personally, I've little experience of non-western encodings.

For more concrete use cases to cover with encoding, you should look at:

  • query parameters coming in from browsers
  • POSTed form parameters coming in from a browser.
  • What encoding command line arguments and environment variables use. Yes, this caught me out earlier this week.
  • Getting the right encoding from the database.

There are a lot of places where you need to consider byte to character conversion (and vice versa).

Sorry if I'm rambling over a bunch of places you've already covered! Character encoding always seems to affect me... (e.g. the fact I can't use a proper ellipsis character in this comment box!)

-Dom

Re:My advice…

brian_d_foy on 2009-09-25T09:20:50

Can you tell me more about the command line and environment variable problem? I think I'll have the other ones covered, but I'd like to know how you solved that one. I don't recall reading anything about how Perl will treat those.

Sounds like it would be very platform-specific

grantm on 2009-09-25T09:35:34

Personally, I'd just pipe those things through Encoding::FixLatin and enjoy the utf8ness it emits :-)

Re:Sounds like it would be very platform-specific

Hinrik on 2009-09-29T16:09:46

Personally, I'd just pipe those things through Encoding::FixLatin and enjoy the utf8ness it emits :-)

Interesting. I could have used that module a couple of years ago. Since then I've been using this trick to convert UTF-8-or-CP1252 byte strings to UTF-8 text strings:

    use Encode qw(decode);
    use Encode::Guess;

    my $line = <>;
    my $utf8 = guess_encoding($line, 'utf8');
    $line = ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);

http://search.cpan.org/perldoc?POE::Component::IRC::Common#irc_to_utf8

Re:My advice&#226;&#8364;&#166;

Dom2 on 2009-09-25T11:15:25

It's controlled through the -C flag (see perlrun). Here's an example of using U+0100 (Ā) on the command line. The file contains the word "Ādam".

$ mate ~/Desktop/adam.txt
$ adam=$(<~/Desktop/adam.txt)
$ xxd ~/Desktop/adam.txt
0000000: c480 6461 6d0a                           ..dam.
$ perl -MDevel::Peek -le 'Dump $ARGV[0]' $adam
SV = PV(0x801168) at 0x800954
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x2044f0 "\304\200dam"\0
  CUR = 5
  LEN = 8
$ perl -MDevel::Peek -CA -le 'Dump $ARGV[0]' $adam
SV = PV(0x801168) at 0x800954
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x2044f0 "\304\200dam"\0 [UTF8 "\x{100}dam"]
  CUR = 5
  LEN = 8

Actually, my problem was with Java, but the same principle applies. :) I've just noticed that this doesn't work for environment variables. That's a shame.

$ export adam
$ perl -MDevel::Peek -le 'Dump $ENV{adam}'
SV = PVMG(0x80a4c0) at 0x800b40
  REFCNT = 1
  FLAGS = (SMG,RMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x2057d0 "\304\200dam"\0
  CUR = 5
  LEN = 8
  MAGIC = 0x2057e0
    MG_VIRTUAL = &PL_vtbl_envelem
    MG_TYPE = PERL_MAGIC_envelem(e)
    MG_LEN = 4
    MG_PTR = 0x205800 "adam"
$ perl -MDevel::Peek -CA -le 'Dump $ENV{adam}'
SV = PVMG(0x80a4c0) at 0x800b40
  REFCNT = 1
  FLAGS = (SMG,RMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x2057d0 "\304\200dam"\0
  CUR = 5
  LEN = 8
  MAGIC = 0x2057e0
    MG_VIRTUAL = &PL_vtbl_envelem
    MG_TYPE = PERL_MAGIC_envelem(e)
    MG_LEN = 4
    MG_PTR = 0x205800 "adam"

This is all on perl 5.8.8, BTW. It may be fixed in later versions.

cmdline handling

srezic on 2009-09-26T13:43:28

Command line arguments come in as raw bytes. So you have to detect the codeset of the user's environment and encode if necessary. Roughly like this:

use I18N::Langinfo qw(langinfo CODESET);
use Encode qw(decode);
my $codeset = langinfo(CODESET);
for (@ARGV) { $_ = decode $codeset, $_ }

Likewise for environment variable values.

Some Swedish Unicode

jeremiah on 2009-09-29T20:43:41

Jag önskar dig forsatt trevlig läsning och har det så bra!

Translation: I wish you more fun reading and take it easy!

real world example

daxim on 2009-10-23T10:36:12

My recommendation is to avoid \N and \x escapes except for whitespace and combining characters. Literal characters that can be read immediately and copy-pasted anywhere are much more useful.

»Perl« is a proper name and is not translated (I haven't even seen it transliterated where it would be possible), »monger« is also very difficult to translate because of its multiple denotations in English (of course that word was picked deliberately for this reason). Can you substitute something easier?

I have a much better regex example anyway. I held a talk about this at the last Vienna.pm meeting. The purpose is to break down a long list of country and city names into pages. This code sample is very DWIMmy and demonstrates several features:

  • Perl source code can be written in UTF-8.
  • Regex and its character classes can take literal characters.


use utf8;

# [...]

if ('ja' eq $self->_language) { # godyûon pagination
        %pages = (
                all => { label => '[all]', re => qr/.*/msx },
                0 => { label => '[あ]', re => qr/\A [ぁ-おァ-オ]/msx },
                1 => { label => '[か]', re => qr/\A [か-ごカ-ゴ]/msx },
                2 => { label => '[さ]', re => qr/\A [さ-ぞサ-ゾ]/msx },
                3 => { label => '[た]', re => qr/\A [た-どタ-ド]/msx },
                4 => { label => '[な]', re => qr/\A [な-のナ-ノ]/msx },
                5 => { label => '[は]', re => qr/\A [は-ぽハ-ポ]/msx },
                6 => { label => '[ま]', re => qr/\A [ま-もマ-モ]/msx },
                7 => { label => '[や]', re => qr/\A [ゃ-よャ-ヨ]/msx },
                8 => { label => '[ら]', re => qr/\A [ら-ろラ-ロ]/msx },
                9 => { label => '[わ]', re => qr/\A [ゎ-ゔヮ-ヴ]/msx },
        );
} else { # latin pagination
        %pages = (
                all => { label => '[all]', re => qr/.*/msx },
                0 => { label => '[A-D]', re => qr/\A [A-D]/msx },
                1 => { label => '[E-H]', re => qr/\A [E-H]/msx },
                2 => { label => '[I-L]', re => qr/\A [I-L]/msx },
                3 => { label => '[M-P]', re => qr/\A [M-P]/msx },
                4 => { label => '[Q-T]', re => qr/\A [Q-T]/msx },
                5 => { label => '[U-Z]', re => qr/\A [U-Z]/msx },
        );
}

This is not the complete code to achieve the result. People who are experienced in using i18n and collation will see the exceptions and edge cases at one glance. I left it away here because I want to concentrate on the topic at hand. Do you want the rest, too?