I have been writing/supporting a localized and globalized web application that used(still uses) shift jis for japanese character encoding. Our newest webapps uses many better technologies (Mason which I really really really like) and UTF8 for character encoding inside both the middleware and web app when handling of text and filenames. We do a lot of file processing (up and down). With that amount of filename processing (including striping off path or renaming when filenames exceed lengths, I ran into many shift jis characters that required special processing to protect them.
We have a couple of basename (subs, functions, etc. depending upon language) that detect and guard against shift jis slamming. Generally they follow this form:
my $bn; ## basename string while ($loc <= length($path)) { my $chr = substr($path, $loc, 1); ## grab the basename if we match the ## directory sep if ($chr eq $sep) { $bn = substr($path, $loc+1, length($path)); $loc++; next; } ## it's in the ascii range so it's a single byte ## character, only move forward one character if ($chr =~ /[\x00-\x7f]/) { $loc++; next; } ## first is dbl byte, skip following character which ## is part of the dbl character $loc++; $loc++; }
This basically walks each character looking for magic hex pattern before '/' (5C) or '\' (2F) if (815C, 825C, 835C... range), since shift jis uses as part of the character. What A Pain in the ass...
But the blessed UTF8 does not require any of that crap. Yeah! And now I can use (at least for Perl) standard modules. See:
my $path = shift; ## only change for mac or win. (unix ok) if (isWin) { fileparse_set_fstype("MSWin32"); } elsif (isMac) { fileparse_set_fstype("MacOS"); } ## default to unix since it rules... my $b = basename($path); fileparse_set_fstype("Unix"); ## reset it to avoid later issues return($b);
Happiness... until the wreslting with content dispositions and utf8, another story for another day..