A REALLY deep corner case

ChrisDolan on 2008-05-16T04:47:59

Try out this test program in a Perl prior to 5.8.8:

use Test::More tests => 3;
my $line = "\x{4E00}();" . ' ';
is(length substr($line, 1, 1), 1);
is(length substr($line, 1, 4), 4);
is(length substr($line, 1, 1), 1);


You'd expect that substrings of length 1 are always length 1, right? On my Mac (perl5.8.6) it produces:

1..3
ok 1
ok 2
not ok 3
#   Failed test at utf8_substr.t line 5.
#          got: '4'
#     expected: '1'
# Looks like you failed 1 test of 3.


This should surprise you, unless perhaps you were aware of the UTF-8 length caching bug(s) that haunted much of the 5.8.x series.

This program above is a minimal reduction of a failure in the PPI test suite (see RT#35917 - charsets.t eats all available VM). This bug is only triggered in the following case:

  • Perl 5.8.6 (and maybe 5.8.7?)
  • PPI above 1.201
  • Source code which uses Unicode in a bareword on the last line of the file, but not within the last 3 bytes of the end.


We would probably have never noticed, except 5.8.6 is the default Perl for Mac OS X 10.4 (i.e., a popular point release) and a PPI side effect of the bug was a infinite loop with a memory leak.

I'm VERY grateful that the core Perl developers include people smart enough to find and fix subtle bugs in the Unicode implementation like this one.


Yahtzee

jk2addict on 2008-05-16T13:09:18

"Source code which uses Unicode in a bareword on the last line of the file, but not within the last 3 bytes of the end." ...on a Tuesday, with a full moon, while sacrificing a chicken, and singing the words to Sweet Home Alabama, while standing on the right foot only...

Re:Yahtzee

Aristotle on 2008-05-16T16:44:38

“SCSI is *not* magic. There are fundamental technical reasons why it is necessary to sacrifice a young goat to your SCSI chain now and then.” —John Woods