Partly-compatable regular expressions

avar on 2007-06-23T04:49:15

Since I finished my changes to what will become the pluggable regex API in perl 5.10 I've been working on writing re::engines again, first on Plan 9 and then on finishing PCRE which audrey and yves started (but didn't quite finish).

I based the new PCRE wrapper on the Plan 9 and upgraded the underlying PCRE library to 7.2, and aside from a bug in how split is handled it works for most of the cases where the Perl engine does.

Having wrapped PCRE running Perl's own regex tests under PCRE becomes really easy. There are almost 1300 test for the regex syntax in t/op/regex.t in perl core. Running these under re::engine::PCRE reveals the following incompatibilities (and some bugs) between it and Perl:

(?{}) and (??{}) tests fail (obviously). Getting at least (?{}) to work might be possible with pcre's callout mechanims but I haven't looked closely at that.

A few tests such as "bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/ fail because PCRE recurses away and runs into its internal MATCH_LIMIT while recursing. Using pcre_dfa_exec() instead of pcre_exec() yields a match but the DFA routine has different matching semantics.

One test fails because PCRE treats [a-[:digit:]] as an invalid range while Perl takes it to mean a character class matching 'a', '-' or a [:digit:]. Perl is probably being too permissive in this case.

"aba" =~ /^(a(b)?)+$/; say "$1-$2" will yield "a-b" under PCRE but "a-" under Perl. That is, PCRE eats the inner (b) while perl goes with the outer +. Both match the entire string.

Perl accepts curly modifiers on (?!) e.g. /foo(?!bar){2}/ but PCRE doesn't. I couldn't get Perl to do anything useful with that though. /foo(?!bar{2})/ works in both engines and doesn't match "foo" followed by "barbar".

PCRE does not match <!>!>><>>!>!>!> against ^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$ but Perl does. I haven't looked into why.

PCRE does not support (*FAIL) and (*F) which cause the pattern to fail, nor does it support (*ACCEPT).

Three tests try to match \x{85} against \R in an UTF-8 upgraded string ("\305\205") in a pattern that wasn't compiled with PCRE_UTF8. This isn't a PCRE issue but an API usage problem in re::engine::PCRE, the best solution is probably to upgrade all patterns and strings to UTF-8 before calling pcre_compile/pcre_exec.

PCRE accepts numeric keynames such as ^(?'0'ook)$, ^(?<0>ook)$, ^(?<1a>ook)$. These all match the literal string "ook" and set up the named capture "0" or "1a". Perl does not currently accept named buffers that start with a number.

re::engine::PCRE doesn't support multiple named match buffers under the same name while Perl does. At first I thought this was a PCRE limitation but it turns out that I just didn't know about the PCRE_DUPNAMES option:)

So aside from inline eval re::engine::PCRE is pretty much a drop-in replacement for Perl's engine. And perhaps more significantly PCRE's compatability can now be tested (and errors fixed) by running it against Perl's own test suite.

To run Perl's tests on PCRE get blead and re::engine::PCRE 0.10 (coming to a CPAN near you), build it and run:

perl5.9.5 -Mblib t/perl/regexp.t

By default it skips the failing tests, these can currently be enabled by commenting out line 86 in regexp.t:

@pcre_fail{@pcre_fail} = ();