why UTF8 is wonderful

Lecar_red on 2005-03-08T22:59:02

I have been writing/supporting a localized and globalized web application that used(still uses) shift jis for japanese character encoding. Our newest webapps uses many better technologies (Mason which I really really really like) and UTF8 for character encoding inside both the middleware and web app when handling of text and filenames. We do a lot of file processing (up and down). With that amount of filename processing (including striping off path or renaming when filenames exceed lengths, I ran into many shift jis characters that required special processing to protect them.

We have a couple of basename (subs, functions, etc. depending upon language) that detect and guard against shift jis slamming. Generally they follow this form:

    my $bn; ## basename string
    while ($loc <= length($path)) {
        my $chr = substr($path, $loc, 1);

        ## grab the basename if we match the
        ## directory sep
        if ($chr eq $sep) {
            $bn = substr($path, $loc+1, length($path));
            $loc++;
            next;
        }

        ## it's in the ascii range so it's a single byte
        ## character, only move forward one character
        if ($chr =~ /[\x00-\x7f]/) {
            $loc++;
            next;
        }

        ## first is dbl byte, skip following character which
        ## is part of the dbl character
        $loc++; $loc++;
    }

This basically walks each character looking for magic hex pattern before '/' (5C) or '\' (2F) if (815C, 825C, 835C... range), since shift jis uses as part of the character. What A Pain in the ass...

But the blessed UTF8 does not require any of that crap. Yeah! And now I can use (at least for Perl) standard modules. See:

    my $path = shift; 
                      
    ## only change for mac or win. (unix ok)
    if (isWin) {      
        fileparse_set_fstype("MSWin32");
    } elsif (isMac) { 
        fileparse_set_fstype("MacOS");
    }                 
        
    ## default to unix since it rules...              
    my $b = basename($path);
    fileparse_set_fstype("Unix"); ## reset it to avoid later issues
    return($b);

Happiness... until the wreslting with content dispositions and utf8, another story for another day..