Weird unicode corruption

mdxi on 2003-12-12T04:56:57

I have since chosen to come at this from a different direction, eliminating the problem, but I'm still confused by the behavior I saw:

I am reading in a unicode text file, tab delimited. Each line is split on tab into an array and then some array elements are processed individually. One of them may contain katakana or hiragana text (on or kun readings of kanji). For my first whack at a database import, I just passed this field through untouched.

Later I added this code, which checks to see what script the field is in:

@readings = split(/,/,$word[4]);
foreach $chunk (@readings) {
    $chunk =~ s/\s*//g;
    if ($chunk =~ /\p{InKatakana}/) {
	$on .= $chunk . ",";
    } else {
	$kun .= $chunk . ",";
    }
}


Nothing really unusual, but this code causes other fields containing non-roman text to be corrupted. The corruption looks random, but is always the same fields on the same lines of the file. It is happening at the perl level, not the postgres level. Commenting out the code or backing out to a revision before it was added makes the problem go away.

I have absolutely zero idea what the actual issue could be. "Unicode" is the only obvious suspect. As mentioned above, I resolved and/or removed the issue by doing detection elsewhere, but if anyone knows WHY this happened, I'd sure like to know.