I have been writing/supporting a localized and globalized web application that used(still uses) shift jis for japanese character encoding. Our newest webapps uses many better technologies (Mason which I really really really like) and UTF8 for character encoding inside both the middleware and web app when handling of text and filenames. We do a lot of file processing (up and down). With that amount of filename processing (including striping off path or renaming when filenames exceed lengths, I ran into many shift jis characters that required special processing to protect them.
We have a couple of basename (subs, functions, etc. depending upon language) that detect and guard against shift jis slamming. Generally they follow this form:
my $bn; ## basename string
while ($loc <= length($path)) {
my $chr = substr($path, $loc, 1);
## grab the basename if we match the
## directory sep
if ($chr eq $sep) {
$bn = substr($path, $loc+1, length($path));
$loc++;
next;
}
## it's in the ascii range so it's a single byte
## character, only move forward one character
if ($chr =~ /[\x00-\x7f]/) {
$loc++;
next;
}
## first is dbl byte, skip following character which
## is part of the dbl character
$loc++; $loc++;
}
This basically walks each character looking for magic hex pattern before '/' (5C) or '\' (2F) if (815C, 825C, 835C... range), since shift jis uses as part of the character. What A Pain in the ass...
But the blessed UTF8 does not require any of that crap. Yeah! And now I can use (at least for Perl) standard modules. See:
my $path = shift;
## only change for mac or win. (unix ok)
if (isWin) {
fileparse_set_fstype("MSWin32");
} elsif (isMac) {
fileparse_set_fstype("MacOS");
}
## default to unix since it rules...
my $b = basename($path);
fileparse_set_fstype("Unix"); ## reset it to avoid later issues
return($b);
Happiness... until the wreslting with content dispositions and utf8, another story for another day..