Regex for UTF-8 octets (from perlunicode)

jtrammell on 2006-08-01T16:04:32

From "perldoc perlunicode":

Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte

U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+D800..U+DFFF ******* ill-formed ******* U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
And the equivalent regex:
qr{
        (?:
                                                [\x00-\x7f]  #   U+0000 .. U+007F
        |
                                    [\xc2-\xdf] [\x80-\xbf]  #   U+0080 .. U+07FF
        |
                               \xe0 [\xa0-\xbf] [\x80-\xbf]  #   U+0800 .. U+0FFF
        |
                        [\xe1-\xec] [\x80-\xbf] [\x80-\xbf]  #   U+1000 .. U+CFFF
        |
                               \xed [\x80-\x9f] [\x80-\xbf]  #   U+D000 .. U+D7FF
        |
                        [\xee-\xef] [\x80-\xbf] [\x80-\xbf]  #   U+E000 .. U+FFFF
        |
                   \xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf]  #  U+10000 .. U+3FFFF
        |
            [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf]  #  U+40000 .. U+FFFFF
        |
                   \xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]  # U+100000 .. U+10FFFF
        )
}x;

This has proven useful as I search for errant Latin-1 characters embedded in some files.


Regexp::Common

ChrisDolan on 2006-08-02T03:45:05

Nice. You should submit this for inclusion in Regexp::Common.