More regex

robin on 2002-04-09T13:59:38

[Note: There is a full account of my recursive regex idea in this article.]

I've found the bug in my PCRE patch, which is partly to do with the way * repetitions are handled. But you don't actually need to use iterative repetitions any more, because you can replace iteration with recursion! /a*/ can be rewritten as /((a(?1))?)/. And if you do that, you sometimes avoid triggering the bug. So you can test for matching XML-style tags like this:

£^(<\w+/>|<(\w+)>([^<>]|(?1)|)(?3))$£
I'll fix the bug soon...

I've also managed to prove that all context-free languages can indeed be expressed. The proof takes the form of an algorithm for turning a context-free grammar into a regex:

  • Eliminate left recursion from the grammar. (This is a standard procedure, but quite complicated.)
  • Write the grammar as a system of equations. For example, the grammar
    • S --> ''
    • S --> '(' T ')' S
    • T --> S
    • T -> 'x' T
    becomes
    • S = '' + '(' T ')' S
    • T = S + 'x' T
  • Now work out the least solution for S in terms of a least fixpoint operator µ. In our example that is
    • µS. ('' + '(' µT.(S+'x'T) ')' S )
  • And translate the µ-expression into a regex:
    /(|\(((?1)|x(?2))\)(?1))/
  • (That regex doesn't actually work yet, because of various bugs. But it ought to.)
Of course, the interesting part is proving that the algorithm really works. I plan to write it up in more detail soon.