This week was the can-of-Unicode-worms-festival week for the Perl 5 porters. Regular expressions were another recurrent topic. Read on for details.
The Big Topic of this week was UTF-8, Unicode, and how Perl deals with it.
This all started with a report about seemingly innocuous UTF-8 failures. Digging into this deeper, Chip Salzenberg pointed out a flaw in Perl's handling of Unicode strings: conversions from byte strings (with "regular" eight-bit chars) to UTF-8 currently map high bit characters to Unicode without translation (or, depending on how you look at it, by implicitly assuming the byte strings are in Latin-1). This is potentially wrong, because Perl assumes the C locale by default. Thus upgrading a string to UTF-8 may change the meaning of its contents regarding character classes, case mapping, etc. But this behaviour was chosen in perl 5.8.x for backwards compatibilty.
http://groups.google.com/groups?selm=20040310170059.GE2455%40perlsupport.com
Jarkko Hietaniemi, former 5.8 pumpking and Unicode guru, stepped into the discussion and provided insight. Various solutions were proposed and discussed.
http://groups.google.com/groups?selm=40524044.1090704%40iki.fi
Should upgrade from byte strings that contain characters in the range
0x80-0xFF be forbidden, or emit a warning? Autrijus Tang, deciding to
speak in code, released a module on CPAN that implements this last
solution, and wishes them to be integrated into the core at some point in
the 5.9 development track. This would need also to turn the encoding
pragma into a lexically-scoped one (like locale
currently is.)
http://groups.google.com/groups?selm=1079275492.4005.8.camel%40localhost http://search.cpan.org/user/autrijus/encoding-warnings-0.03/lib/encoding/warnings.pm
While we're at it, Nick Ing-Simmons wonders what's the proper method for XS coders to get UTF-8 data (without converting an SV to UTF-8 in place, which is considered a Bad Thing). Sadahiro Tomoyuki provides some answers.
http://groups.google.com/groups?selm=20040307181606.2729.7%40llama.ing-simmons.n et
substr()
lvaluesTon Hospel reported (some time ago) bug #24346, concerning the behaviour
of the return value of substr()
when it is used as an lvalue. He points
out, with examples, that the current situation is not satisfactory,
because the lvalue acts as a fixed-length window. This causes in some
cases some surprising action at distance, making a variable (coming from
the result of a substr())
hold a value different from the one it has been
assigned to.
Graham Barr fixed this problem. Nicholas, apparently, still hesitates whether this should go in perl 5.8 or not, in the absence of any good argument for or against.
http://groups.google.com/groups?selm=rt-24346-66654.1.8290615224722%40rt.perl.or g
Hugo reports that Damian reported that use re 'eval'
is not seen in
patterns interpolated at run-time via/(??{...})/
. Yitzchak
Scott-Thoennes explains that this comes from the fact that this
compile-time pragma setting is no longer seen at run-time (and this is one
more reason to rewrite the support for pragmas in the core.)
http://groups.google.com/groups?selm=200403100343.i2A3hWP03026%40zen.crypt.org
Hugo reports also a case of incorrect regexp compilation warning (bug
#27603) with/(??{...})/
blocks:
http://groups.google.com/groups?selm=rt-3.0.8-27603-81805.2.6610882472044%40perl .org
Jamie Lokier found a bug in the regular expression engine, more precisely
in the optimisation pass (bug #27515), leading to wrong interpretation of
the regular expression/^(.*)(?=x)x/
. Hugo confirmed that this was a
known bug, possibly difficult to fix.
http://groups.google.com/groups?selm=rt-3.0.8-27515-81033.1.09945237479955%40per l.org
Jamie found also that using return()
from a/(?{...})/
block may lead
to segmentation fault (bug #27595). Such blocks are considered completely
broken by the higher authorities (Dave Mitchell) and are hopefully to be
reimplemented.
Rafael reports that the source-filter-based Switch
module is confused
by occurences the ($)
function prototype in the filtered source. (Bug
#27472.)
Chip Salzenberg fixed the line-buffering problem noticed by Stas Bekman last week.
Paul Kramer remarks that one can't change the ownership of a symlink with
perl's chown()
built-in. Rafael suggests to add lchown()
to the POSIX
module (which contains chown()
already.) (Bug #27547.)
Nicholas Clark proposed a load of patches for Storable
: fixes for
storing restricted hashes, references to undef
, plus a space
optimization. (Bug #27616.)
Arthur Bergman released the second development release of Ponie, which seems to be impressive so far.
http://groups.google.com/groups?selm=EBE5CF35-7445-11D8-8D57-000A95A2734C%40nani sky.com
Tels released new versions of his math packages, Math::BigInt v1.70, bignum 0.15, and Math::BigRat 0.12.
This summary was written by Rafael Garcia-Suarez. Weekly summaries are published on http://use.perl.org/ and posted on a mailing list, which subscription address is perl5-summary-subscribe@perl.org . Comments and corrections are welcome.