re2xs

Matts on 2006-08-11T15:19:29

I wrote a script that takes "regexp:string" pairs (one per line) and converts the regexps to re2c format [*] and builds an XS module out of them. I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.

Here's the code. Please feel free to play with it and let me know if it's useful - particularly if you can patch it to support more of the Perl regexp format.

You need a recent re2c to make it work. And don't expect your XS module to compile quickly if you have a lot of regexps. The 10k regexp test took over 3 hours to compile on my Core Duo 2Ghz.

[*] re2c uses an entirely different format for regexps than perl does. So the core of re2xs is a regexp parser which converts to this other format.


coooool

jmason on 2006-08-11T15:28:33

Thanks Matt -- I'm pretty excited about this code ;)

Example please?

bart on 2006-08-12T00:25:37

I stared at this piece of selfdocumenting code for a while:
        my ($regexp, $reason) = /^(.*):(.*)$/;
        $regexp =~ s/^\.\+//;
        print $re "\t", fixup_re($regexp), "               {return \"$reason\";}\n";
but I still can't figure out
  1. what the "reason" is for
  2. how to use the compiled resulting XS module from a Perl script

Could you please provide a 3 (or so) line data file that we could just compile and run in some little demo script? Just to show the intention of it all.

Re:Example please?

Matts on 2006-08-12T14:06:04

Yeah sorry - it's a hack that I didn't have any time to document.

Input is a file that looks like this:
(duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
[0-9a-f]+\.myntet\.ac:generi c
[0-9]+[a-z]\.old\.myntet\.ac:generic
lgh[0-9]+\-p[0-9]+\.nejlikan\.ac:edu
i.e. "regexp:reason"

Then to run, it's just:
use MyModule;
 
if (my $reason = MyModule::scan($string)) {
    print "Matched: $reason\n";
}

Re:Example please?

bart on 2006-08-12T22:55:52

OK, I tried it... and it works. Kind of — I had to delete the "time" in the system() call as Windows doesn't support it. Anyway, the sample compiled fast. Very fast. You scared me for no reason. :)

I'm somewhat disappointed with what the module can do. I was hoping to have a basis to reimplement URI::Find, thus: something that can find matches anywhere in a random text. There's two major reasons why it can't do that. First: it really is a lexer: it can only match prefixes in a string. To use your example:
(duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
matches both "duc123.old.fagotten.ac" and "duc123.old.fagotten.acdc", but not "viaduc123.old.fagotten.ac". Second: it doesn't return what it matched, or even the length of the match, it just returns a "reason", in this case: the string "generic". That's not very useful, even for a lexer.

What I would really love to see, is a search for a substring, using a scheme that may resemble Boyer-Moore for skipping over uninteresting submatches: based on what was seen earlier, you just know some prefixes just can't match, and you can just skip them. Opposed to fixed search string, I'm guessing that this will be anything but trivial — what a regex matches has neither a fixed length nor fixed characters in each position. I have doubts that it's even possible, in the generic case.

I don't expect that re2c would support anything even remotely like it.

In summary: I find it simply amazing work how easy you make it to generate an XS module from code generated using an external tool. As for the limitations to make it really useful, for the applications I'm thinking of, are probably limitations in what re2c can do.

Re:Example please?

jmason on 2006-08-12T23:19:31

you might want to keep an eye on http://svn.apache.org/viewvc/spamassassin/branches/jm_re2c_hacks/rule2xs/ -- I'm hacking away on it for SpamAssassin, and I think with work you could probably find those features working there.

Matt, does it really support [classes], (alt|er|nations), and {quantifiers}? wow, I wasn't even expecting that!! holy crap.

Re:Example please?

Matts on 2006-08-13T00:21:57

To match anywhere in the string prefix with ".*". I haven't quite figured out how to tie to the end of the string yet, but it should be doable with what the code generates.

The limitation on just returning the reason is arbitrary - that's all I needed for the given problem domain, but you can definitely return "what matched" and "where in the string?". That should be a simple matter of programming.

(the long compile times are for when you have LOTS of regexps - I compile over 15k into one module).

update

jmason on 2006-08-25T11:51:56

Matt, you've seen this already -- but for random googlers who come across this page, http://taint.org/2006/08/17/125452a.html may be worth reading.

Basically, I investigated using re2c/re2xs as a speed-up engine for SpamAssassin, without much luck. The problem is that re2c can only track one regexp's state at a time, so overlapping regexps are handled inconsistently; users calling the code need to know in advance if one regexp is subsumed by another, otherwise the subsumed regexp will never match when the subsumer does. So re2c isn't useful for all use cases...

SA?

jmason on 2006-11-06T19:04:55

well, I spoke too soon -- it looks like it's pretty fast nowadays ;) consistently provides about 20% speedup for me, which is nice.

btw, I forgot to ask -- are you OK with licensing that script's code under the ASL for inclusion in SpamAssassin?