The Power of Ruby's Regex Engine

Ovid on 2004-09-09T19:29:13

At last night's Perl Mongers meeting, we again had some Ruby folk around to show us a bit of their language of choice. One thing I found interesting was their regex engine. Apparently it's reentrant. They have the same regex variables we do but those are merely copies of the relevant data. They are apparently not used to determine the state of the engine. Instead, each regex is assigned its own "match" object. As a result, a regex can embed a code block which in turn calls more regexes without blowing up. You can't do this in Perl.

As a result, I've been toying with the idea of using Inline::Ruby because one of my pet projects could benefit from this (yes Adrian, I hear you telling me to use the call stack :)

Wouldn't it be a bit ironic that I'd be forced to use Inline::Ruby to take advantage of the power of Ruby's regular expressions?


Ironic?

jplindstrom on 2004-09-09T21:51:05

If not for the kleptomaniac nature of the Perl language we wouldn't have all those nifty features to play with. Including regular expressions.

And the world rightfully became impressed with the Perl regexp engine. Good for them.

Why shouldn't we steal back recursive regexps from Ruby? (modulo implementation difficulties)

Yes, but..

Aristotle on 2004-09-10T15:55:16

It's also one of the slower engines, and at least in Ruby 1.6 its \G doesn't prohibit regex bump-along (it's "start of current match" rather than "end of last match"), which makes relatively useless to write complex parsers with.

Personally, I'm waiting for Inline::Perl6 ;-)

Re:Yes, but..

Ovid on 2004-09-10T17:14:48

I didn't know about the \G issue, but the slowness doesn't phase me for one simple reason: slow but working versus fast but broken is a sure win in my book. :)

Re:Yes, but..

Aristotle on 2004-09-10T17:34:43

Of course! If you need recursive regexen, fast but broken is obviously useless. Just don't disregard that you can talk about working vs broken only with regard to recursion. For me, Ruby's engine is similarly useless because of its \G behaviour as Perl's engine is for you because of non-reentrancy.

Tool for the job and all that I guess.. :-)

Re:Yes, but..

djberg96 on 2004-09-13T15:54:12

I'm curious about this. Do you have an example that demonstrates this? And does it behave the same in Ruby 1.8?

Re:Yes, but..

Aristotle on 2004-09-13T20:13:30

When trying to match abcde with /\Gx?/g, the first match is successful, because no x is found but the question mark allows zero characters to be consumed. This match ends after zero characters into the string — at start-of-string. In order to avoid infinite loops on a zero-length matches, the engine then retries the match one position down the string.

In Perl, \G means end-of-last-match, and since end-of-last-match was at start-of-string, \G can't possibly match at one character into the string:

$ perl -le'$_="abcde"; s/\Gx?/!/; print'
!abcde

In Ruby (both 1.6 and 1.8, I found), \G merely means start-of-current-match, which, of course, is satisfiable at that point:

$ ruby1.6 -e'puts "abcde".gsub(/\Gx?/,"!")'
!a!b!c!d!e!
$ ruby1.8 -e'puts "abcde".gsub(/\Gx?/,"!")'
!a!b!c!d!e!

Perl's \G is a powerful tool to write parsers because the regex engine is prohibited from skipping characters to find a match — you can work your way through a string with a multitude of patterns using /c (to avoid resetting the end-of-last-match on match failure) applied against the same string in turn, without them sabotaging each other.