Illegal character 0x1FFFF

jozef on 2010-01-27T20:56:05

$ perl -le 'use warnings; my $x=chr(0x1FFFF)'
Unicode character 0x1ffff is illegal at -e line 1.

XML supports UTF-8 so I check for valid UTF-8 string and use it in XML if valid. Right? No!!!

There are some "non-illegal" characters that are perfect valid in UTF-8 (or even in the plain old ASCII), but are invalid for XML. The most obvious 0x00. Here is what W3C XML 1.0 specification say:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I spend some time playing with it and the result is XML::Char->valid(). The dev Data::asXML is using it now. If you you want, have a look at the test suit and try to break it. :-)


UTF-8

Hansen on 2010-01-27T22:29:27

I'm sorry to disappoint you, but Perl_is_utf8_string can't be used to check for well-formed UTF-8. Perls utf8 encoding form is superset of Unicode and ISO/IEC 10646. Perls encoding form supports codepoints up to 2**64-1 and has no problems with encoded UTF-16 surrogates or any other permanently reserved codepoints.

Re:UTF-8

jozef on 2010-01-28T07:29:26

Baf, I'm not disappointed. Nothing can be more disappointing than encoding problems...

Hansen can you have a loot at http://github.com/jozef/String-isUTF8/blob/master/t/01_String-isUTF8.t and send a patch with failing tests?

Re:UTF-8

jozef on 2010-01-28T07:37:43

I've just seen your String::UTF8 module, I didn't know it exists and you should have mentioned it in the first place...

Wrong wrong wrong

Aristotle on 2010-01-28T15:26:46

Don’t look at the UTF8 flag. The UTF8 flag does not mean what you think it means. You can have a perfectly valid Unicode string that does not have its UTF8 flag set, and you can have a JPEG image in a string that does have its UTF8 flag set. The UTF8 flag is a lie. It should not have been called the UTF8 flag. There is no flag in Perl that means what you think the UTF8 flag means. Don’t look at the UTF8 flag.

What you want to do is very simple:

sub _is_valid_xml_string {
  $_[0] !~ /[^\x9\xA\xD\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/
}

That’s it. Anything else is wrong. (No seriously. It’s wrong.)

Re:Wrong wrong wrong

jozef on 2010-01-28T18:06:01

The implementation of XML::Char is in XS code - http://cpansearch.perl.org/src/JKUTEJ/XML-Char-0.01/lib/XML/Char.xs and there is no UTF8 flag checking. In the test http://cpansearch.perl.org/src/JKUTEJ/XML-Char-0.01/t/01_XML-Char.t there are both valid and invalid strings with and without UTF8 flag.

Re:Wrong wrong wrong

Aristotle on 2010-01-29T03:15:21

I misunderstood where the problem is in the code, but it’s still wrong. Since it’s XS, you specifically do need to look at the flag, explicitly:

#!/usr/bin/perl

use strict;
use warnings;
use utf8 ();

use Test::More tests => 2;
use XML::Char;

my $str = "\xC3";
is( XML::Char->valid($str), !!1, "accept U+00C3 with UTF8 flag off" );

utf8::upgrade($str);
is( XML::Char->valid($str), !!1, "accept U+00C3 with UTF8 flag on" );

__END__
1..2
not ok 1 - accept U+00C3 with UTF8 flag off
#   Failed test 'accept U+00C3 with UTF8 flag off'
#   at - line 7.
#          got: '0'
#     expected: '1'
ok 2 - accept U+00C3 with UTF8 flag on
# Looks like you failed 1 test of 2.

The problem is that you’re using utf8_to_uvuni unconditionally. But the PV of a string with SvUTF8 off has a different format than when the flag is on. You should be using utf8_to_uvuni only if the flag is on; otherwise, you should just take one byte at a time from the string and use that directly.

FWIW, since there are only three ranges and three single codepoints, I wouldn’t use a loop for the conditionals. Just unroll the whole thing.

So add the above code to the module as 02_utf8_flag.t, remove Char.h, and replace Char.xs with the following code. After that, all tests will pass.

#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"

#include "ppport.h"

static UV
octet_to_uvuni(const U8 *s, STRLEN *retlen)
{
    *retlen = 1;
    return (UV) *s;
}

MODULE = XML::Char    PACKAGE = XML::Char

void
_valid_xml_string(string)
    SV* string;

    PREINIT:
        STRLEN len;
        U8 * bytes;
        int in_range;
        int range_index;

        STRLEN ret_len;
        UV     uniuv;
        UV     (*next_chr)(const U8 *s, STRLEN *retlen);

    PPCODE:
        bytes    = (U8*)SvPV(string, len);
        next_chr = SvUTF8(string) ? &utf8_to_uvuni : &octet_to_uvuni;

        while (len > 0) {
            uniuv = (*next_chr)(bytes, &ret_len);
            bytes += ret_len;
            len   -= ret_len;

            if (
                (uniuv < 0x20) && (uniuv != 0x9) && (uniuv != 0xA) && (uniuv != 0xD)
                || (uniuv >  0xD7FF) && (uniuv <  0xE000)
                || (uniuv >  0xFFFD) && (uniuv < 0x10000)
                || (uniuv > 0x1FFFF)
            ) XSRETURN_NO;
        }

        XSRETURN_YES;

On an API stylistic note, I really really hate when modules expect me to call functions as methods. How about renaming the XS function to is_valid_xml_string and making it exportable? Then people have the option to either write XML::Char->valid($foo) or exporting it and writing is_valid_xml_string($foo).

Re:Wrong wrong wrong

jozef on 2010-01-29T09:35:30

thank you!