Unicode URLs, wtf?

jjore on 2010-01-07T05:43:29

Hey internet, ⠸⠙⠱ ⠝⠉⠁⠈ ⠅⠝⠁⠕⠕⠉⠃ ⠝⠆⠏⠍⠞?

A year or more ago I was fixing work's web site to handle Unicode as entered by users into fields. We don't use CGI.pm because....? Well ok, we just don't. It also doesn't handle Unicode properly either. Or at least almost no version. Huh.

If a user types "Coatıcook" you'll probably get the dotless "i" character as either %C4%B1 or %u131 but CGI.pm as supplied by perl almost most of the time won't do something reasonable.

  • not ok 5.11.3 CGI-3.48
  • not ok 5.10.1 CGI-3.43
  • ok 5.10.0 CGI-3.29
  • not ok 5.8.9 CGI-3.42
  • not ok 5.6.2 CGI-2.752


Wut?

for v in 5.11.3 5.10.1 5.10.0 5.8.9 5.6.2;do
  /opt/perl-$v-64-thr-dbg/bin/perl\
    -le '
      use CGI;

my $input = "a=%u2021"; my $expect = "\x{2021}"; my $got = CGI->new( $input )->param( "a" );

print $expect eq $got ? "ok $] $CGI::VERSION" : "not ok $] $CGI::VERSION" '; done


CGI.pm

Hansen on 2010-01-07T11:31:36

CGI.pm decodes the non-standard (and invalid according to RFC 3986) pct escape into a UTF-8 octet string, but it doesn't decode it into perl unicode string. I think the current behavior is desirable since the data can contain any octets in any encoding.

--
chansen

Re: Unicode URLs, wtf?

daxim on 2010-01-07T12:12:04

> %u131

What sort of encoding is that? I mean, I can see it's the Unicode codepoint preceded by %u, but which standard backs this? I've never encountered this before.

Here's my take on it:

use CGI qw();
use Encode qw(decode_utf8);

my $input  = 'a=%C4%B1';
my $expect = "\x{131}";
my $got    = decode_utf8(CGI->new($input)->param('a'));
# as per best practice http://search.cpan.org/perldoc?CGI#-utf8

use Devel::Peek qw(Dump); Dump $expect; Dump $got;

print $expect eq $got
  ? "ok $] $CGI::VERSION"
  : "not ok $] $CGI::VERSION"

__DATA__
SV = PV(0x88bc40) at 0x8c12f8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x8aaad0 "\304\261"\0 [UTF8 "\x{131}"]
  CUR = 2
  LEN = 8
SV = PV(0xac9e60) at 0x8c13e8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0xad5740 "\304\261"\0 [UTF8 "\x{131}"]
  CUR = 2
  LEN = 8
ok 5.010001 3.48

Re: Unicode URLs, wtf?

Hansen on 2010-01-07T12:50:53

It usually comes from broken javascript applications that uses escape() instead of encodeURI()


escape("\u263A") -> %u263A
encodeURI("\u263A") -> %E2%98%BA

--
chansen

:utf-8 ?

Yanick on 2010-01-08T01:18:50

Did you try using use 'CGI qw/ :utf8 /;'? That seems to work the way you want with CGI 3.49 (at least it seems to on my box).

Re::utf-8 ?

jjore on 2010-01-09T05:14:19

Nope. I'd never noticed the option. My bad!