Encode::DoubleEncodedUTF8

miyagawa on 2007-02-14T09:20:11

I released a new hacky module Encode::DoubleEncodedUTF8.

This module adds new fake encoding "utf-8-de" that automatically finds doubly encoded utf-8 bytes (like \x{c2}\x{e9}) which always happens when you concatenate strings with utf-8 flag on and off. I wouldn't suggest using this module for a production environment (because it might be slow), but this would be really handy to fix the common mistakes made in perl and Unicode/I18N stuff.

The same methodology could be also applied to PHP/Java sites since I see the same bugs on Amazon Web Services or YouTube.

UPDATE: I released 0.02 and now it doesn't heavily use Encode::encode/decode when it finds double encoded utf-8 bytes, hence it's 30 times faster than 0.01.


You...

jesse on 2007-02-14T22:47:56

You are my new hero.

Re:You...

miyagawa on 2007-02-14T23:26:05

Heh, thanks! I fixed the code to make it free from Encode::decode/encode when it seeks dodgy utf-8 bytes and released it as 0.02 on CPAN. It's now all regexp based and it's now 30 times faster :)