Perl5 Regex Engine Abstracted

brian_d_foy on 2006-12-04T15:44:00

With patch Change #29430 the regex engine should now be sufficiently abstracted that it is reasonably feasable to write a plugin to use a different regex engine in perl. The documentation for most of the interface is contained in the perlreguts module although one needs to understand the flags in the regexp.h in order to make it work.

The basic idea is that the existing perl engine and data structures have been ripped in half. The structures which the perl core must interact are now well defined and the parts that are "private" to a given regex engine implementation are now isolated from the core.

A big issue with Perl 5.8 and earlier Perls regex engine pluggability interface was that it was intended mostly for swapping in a DEBUG build of the real engine, with the expectation that any engine in use could cope with the data structures generated by any other engine. Additionally some of the management routines for regexps were harded coded into the core, with the expectation that the core routine could handle any engines data. All of these design issues meant that basically you couldnt plug an arbitrary engine into perl. You would have to apply core patches to do even the meanest implementation.

In the new scheme what Perl needs to know about a pattern, is defined by the struct regexp. What Perl needs to know about a regex engine is defined by the struct regexp_engine. All a plug in needs to do is create a regexp_engine that contains the appropriate callbacks and populate %^H appropriatly when the module is use()d. Perl will then use the compiler routine specified by the engine to create a regexp struct which it will then use as though it was its own. The regexp structure contains a pointer to the regexp_engine structure which created it, so the regexp is completely encapsulated. You can compile a regexp using one engine and pass it through to a routine expecting a normal regexp and everything would work out.

So now all we need to do is get some intrepid hacker to actually put all this to good use and write a plug in for another engine.

And you know what?

The first person that publishes a proper plug in for BleadPerl (soon to be Perl 5.10) will get a METRE OF BEER for doing so....

Ill give a bonus prize (to be determined) if the engine ported is the latest TCL engine.

So lets see how long it takes... :-)

Note: I'm going to publish this challenge elsewhere over the next few days. Thanks to AdamK for the idea of the prize.


0th post^Wentry!

audreyt on 2006-12-04T16:30:07

From #p5p:

17:21 <@audreyt> I need to sleep.
17:21 <@audreyt> and here is my entry:
17:21 <@audreyt> http://perlcabal.org/user/audreyt/tmp/re-engine-y2k-0.01.tar.gz
17:21 <@audreyt> enjoy :)
17:21 <@dmq> i think you win. :-)

Re:0th post^Wentry!

demerphq on 2006-12-04T16:45:23

Ha! Hilarious. Ok, so you win. But I guess I should have been more specific about what I meant.

So to clarify, to qualify for a prize you have to implement a real functional NON-PERL (language or core) regexp engine as a plug in.

Meaning, PCRE, the TCL 4 engine, PGE....

So the show ain't over folks! Theres still beer to be won.

Re:0th post^Wentry!

steph on 2006-12-04T20:19:12

Hi, I wonder what are the other dimensions to that metre...are they

* of capillary size?

* aproximately of a size corresponding to the cross-section of a monster bavarian mug? ...

* not specified? ... which will make us ...well very drunk...(and AdamK poorer) Actually in Madrid where I live more and more people use m when then mean m**2 especially when they talk about appartments)

I remember seeing mentioned in Jeffreys F. book that the latest engine of Henry Spencer (possibly used in >tcl8.4) was an hybrid DFA/NFA and as such did not suffer much when fed pathological entries (of the kind that bring pure NFA engines to their knees). Last time I check it seemed the work had been left unfinished, so I'm not sure what do the tcl people actually use. Is there a special reason why you mention the tcl libraries, besides of looking for a proof of concept for your latest work.

a few months ago I found a reference about SLK parsing and from there got to the "packrat" parsing entry in wikipedia...and became so interested that I actually started to dissect TheDamian's wonderful top-down parser generator to see what I could reuse ...and then real life and $work did not leave me a minute to spare

do you know of that article?, actually I should check audreyt blog's more often as it was mentioned there...as "regex parsing in linear time, woow" or something like that. Is Patrick's PGE using that algorithm?...are they plans to port it to parrot or perl5? I'd like to know as the revival of madrid.pm is under way and maybe I could use that meter of beer trick of yours to lure people in doing some interesting work

cheers and thanks for your work --stephan

Re:0th post^Wentry!

bart on 2006-12-04T20:51:55

Funny that you assume AdamK is paying for this metre of beer, too. That was only last time.

Re:0th post^Wentry!

demerphq on 2006-12-04T21:41:17

Is there a special reason why you mention the tcl libraries, besides of looking for a proof of concept for your latest work.

Because of the hoo-rahing that it got from Friedl. And because to the best of my knoweldge there are only a few C regexp engines up to the job, and the Tcl engine represents the one I know of that is furthest from what Perl itself does.

With regards to the rest, either I dont know, or we will see later. :-)

Re:0th post^Wentry!

belg4mit on 2006-12-04T21:50:46

Well Paul Miller might have a a different idea about that.

Re:0th post^Wentry!

Alias on 2006-12-05T01:53:32

And this folks, is why you need to be VERY specific in challenges for prizes.

No I'm not paying for this one :)

But my standard was to purchase crates/cartons/slabs of "your favourite non-rediculously-expensive beer" (as in whatever you like, but don't screw me over) :) and pile those crates up until they were a metre in height.

Of course, in retrospect I should have said "at least" or "no more than" :) So for stennie's CamelPack prize we approximated by going low and throwing in a six-pack.

But trust me, it's a LOT of beer :)

Bordering on too much actually, but certainly enough to give away free beer when people come to your house, or something...

I claim a winning entry. :-)

audreyt on 2006-12-05T06:22:36

It's on CPAN now as re::engine::PCRE.

You can also download it from http://perlcabal.org/user/audreyt/tmp/re-engine-PCRE-0.01.tar.gz.

Enjoy!

Re:I claim a winning entry. :-)

demerphq on 2006-12-05T14:12:42

Well, it looks like that one did it.

Congratulations Audrey!

Now it needs more tests and some loving care to go from proof of concept to fully usable but I think we have proved the framework is not crippled from the get go.

Thanks a lot, it looks like this is a great leap forward.