(Editorial: Don't frontpage this post, editors. I write it down here to summarize my thought, wanting to get feedbacks from my trusted readers and NOT flame wars or another giant thread of utf-8 flag woes)
I can finally say I fully grok Unicode, UTF-8 flag and all that stuff in Perl just lately. Here are some analysis of how perl programmers understand Unicode and UTF-8 flag stuff.
(This post might need more code to demonsrate and visualize what I'm talking about, but I'd leave it as a homework for readers, or at least thing for me to do until YAPC::Asia if there's a demand for this talk :))
Level 1. "Take that annoying flag off, dude!"
They, typically web application developers, assume all data is encoded in utf-8. If they encounter some wacky garbaged characters (a.k.a Mojibake in Japanese) which they think is a perl bug, they just make an ad-hoc call of:
Encode::_utf8_off($stuff)
utf8::encode($_) if utf8::is_utf8($_)
Re:Thanks
miyagawa on 2008-02-20T22:50:16
Thank *you* for the good tutorial.
So, I've been thinking there need to be some standards for CPAN modules to declare if it accept/return strings or bytes. (If they need to handle both)
For instance, HTML::Parser has an instance method called utf8_mode.
Another example (that triggered me to write this entry) is Catalyst's uri_for() method. At some release the developers changed the implementation to accept only strings (UTF-8 flagged or not) in its %query_values hash.
Based on the complaints and patches made by Japanese developers, they changed the code to accept both strings OR utf-8 bytes, by doing utf8::encode() if utf8::is_utf8(); Like said in the post, this might break latin-1 strings if it's not explicitly upgraded by users using utf8::upgrade() before passing it to the method.
I was suggesting them to make another method, like uri_for_bytes, so as it won't do any utf8::encode() inside the module to treat everything as bytes. But another idea flashed me like "Hey, perl has a core pragma to say that".
bytes.pm.
Does this sound crazy if we change the behavior of these modules by looking at %^H hash values to see if bytes.pm is enabled? (Maybe we can wrap it like bytes::enabled). I know enabling bytes.pm affects functions like index(), substr() and length() globally, so this might not be what you want. They just might want to pass one argument as bytes, and let the other modules/behaviors still be in Unicode semantics. Maybe some packaged scope for bytes pragma?
Hmm.Strings or bytes
Juerd on 2008-02-21T02:22:43
Strings or bytes is not the right distinction, because both kinds are strings. I usually call them "text string" and "binary string", or "character string" and "byte string". Sometimes I call the former "Unicode string" to emphasize that all text strings are Unicode strings.
A trap is the UTF-8 string, which is a byte string representing characters, and has "the flag" off (which to perluninewbies is confusing because this flag is called UTF8). Compare this with the result of pack "N*", LIST, which is a byte string representing numbers. You'll note that UTF-32 looks a lot like pack N* in practice:).
I strongly believe that the behavior of accepting both UTF-8 encoded strings, and SvUTF8 flagged strings, in the same function, is wrong.
I also strongly suggest that any use of "use bytes" and functions in the utf8:: namespace, is misguided. If you want to use bytes, either have a function that deals only with byte input, or have a function that deals only with text input and encode it yourself. The bytes.pm stuff does not encode, it provides a view into perl's internal byte buffer. For text strings, the encoding of this buffer may be either utf8 or latin1.
For DBIx::Simple I have a similar dilemma. I can easily add automatic decoding/encoding for database values, and would love to do so. But databases can also be used for storing binary values. My current plan is to release a very simple CPAN module:
(Typed in my browser, untested.)Then, DBIx::Simple doesn't have to parse SQL and know which columns are blobs: the user can mark a string as a BLOB and I can just skip encoding for those values. PerlIO layers could be told to skip things marked as BLOB when encoding too.package BLOB;
sub mark {
my $class = shift;
my $self = \shift;
bless $self, $class;
}
use base Exporter;
our @EXPORT = qw(is_blob);
sub is_blob {
my $blob = shift;
return undef if not blessed \$blob;
return (\shift)->isa('BLOB');
}
1;
=head1 SYNOPSIS
use BLOB;
BLOB->mark($jpeg_data);
print is_blob($jpeg_data); # 1
my $bytes = is_blob($foo) ? $foo : encode_utf8($foo);
Functions that for some reason need to accept both kinds of string (which can be necessary to support existing stuff, or in heavily abstracted code, but should generally be avoided), can then just tell the user to mark byte strings as blobs before passing them.
To take it one step further, it should have a mechanism like encoding::warnings in place to disallow (fatal error would be best, I believe) upgrading a BLOB.Re:Strings or bytes
miyagawa on 2008-02-21T02:32:59
Hm, just to clarify, I prefer to use characters vs. bytes like you say. If I sometimes use "strings" somewhere, it's just a slip of keystrokes, or I meant Unicode strings instead.
And also, I'm a bit afraid that you misunderstood what I meant with mention to bytes.pm. I didn't mean we should call "use bytes" in this situation to force string operations to be bytes-wise. Not at all.
I meant declaring "use bytes" *might be* a good way for programmers to tell the module authors "Hey I want this module to do whatever string operations as bytes, not as characters".
But obviously that has a side effect and might not be a good thing at all. But I like the idea of having a pragma or something to explicitly say "I DONT WANT UNICODE WISE HERE".
BTW, I find that BLOB module would be useful for other stuff too.
Re:Strings or bytes
Juerd on 2008-02-21T02:59:57
"use bytes;" is lexical: it cannot influence what a module does. I don't know who to thank for this, but I'm happy that at least my code won't be broken at a distance by the numerous uninformed and misinformed people who throw a "use bytes" at their code to replace one kind of (for them) vague behavior with another kind of vague behavior.:)
Experience has show so far that the only workable way of supporting both byte strings and text strings in your function, is to provide two separate functions, or a mechanism to indicate what kind of string you're passing. My BLOB thing would be a standardized way of saying "this is a byte string, not a text string" that is very probably drop-in compatible with existing code.
With the BLOB you're effectively saying "I DON'T WANT UNICODE HERE", but you're still dependent on the module author to comply. Fortunately, scanning documentation for the word "BLOB" is easily done:) Re:Strings or bytes
miyagawa on 2008-02-21T03:23:01
Agreed in both: we should use two different functions to accept characters or bytes, and also BLOB.pm would be useful to DWIM.:)