Another Data::FormValidator Filter

barbie on 2009-08-10T13:23:21

At YAPC::Europe 2009 last week, I launched the Conference Survey during the final keynote, and almost immediately people began submitting their responses. I'll be posting more about the surveys later in the week, but this post concerns itself with a specific technical aspect.

Smylers, being a rather clever fellow, likes to find the edge cases. He found one such edge case in the survey submissions, and although it wasn't a vulnerability, it was potential providing a misleading error to users. The problem arose due to the use of what are usually refered to as Microsoft "smart" characters. These are the characters that don't conform to standard Unicode character sets, as they use a range that is supposed to be reserved for control characters (see Wikipedia for more details).

Smylers had entered an en-dash character and some double quote characters from a Windows machine, and had attempted to submit one of the talk feedback forms. The result was a rather confusing error. The reason being that the backend of the survey system had deleted the field with the smart characters, because they were part of a range not accepted as string characters by the validation code, and flagged as an input error. The solution was to add a filter to the Data::FormValidator profile and translate the characters into something more sensible, before validating the input string. Which is what I did.

As a result Data-FormValidator-Filters-Demoroniser is now winging its way to CPAN. The code has been in the backend system for sometime, just not in the right place to pre-validate input strings. As it turned out it was much easier to abstract it and create a new module than rewrite some of the internal code.

My thanks to Smylers for initially spotting and reporting the bug, the guys behind Data::FormValidator for making it so easy to add the filter, and Dave Wheeler for already implementing many of the translations via his Encode::ZapCP1252 module.

But shouldn't that be done at decoding?

zby on 2009-08-11T08:40:43

I mean at the time of converting byte stream into characters.

Unicode

Smylers on 2009-08-19T11:40:04

Actually they were standard Unicode characters. And I wasn't actually trying to find edge cases; I was just aiming for nice typography and stumbled upon the bug by accident!

For the record, I'd like it to be known I wasn't anywhere near Windows! I was actually using Ubuntu Linux running Gnome. Keyboard preferences lets you define a 'compose' key (I chose Caps Lock, cos that isn't something I ever use) then you can type sequences like Compose --- to get an em dash, or Compose "< to get opening curly quotes; the sequences are reasonably mnemonic.

And those are legitimate Unicode characters. Latin-1 doesn't have them, but then Latin-1 is only an 8-bit encoding so doesn't have most characters. Windows CP1252 caused problems by being kind-of like Latin-1, but with additional characters filling in slots Latin-1 left unused; CP1252 text often got mislabelled as Latin-1, messing things up for non-Windows users.

But all the CP1252 characters are in Unicode, and today you're much better off using the Unicode UTF-8 encoding than either Latin-1 or CP1252, especially on the web.

(And apologies for the delayed response; feed backlog built up while away.)