I wrote a script that takes "regexp:string" pairs (one per line) and converts the regexps to re2c format [*] and builds an XS module out of them. I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.
Here's the code. Please feel free to play with it and let me know if it's useful - particularly if you can patch it to support more of the Perl regexp format.
You need a recent re2c to make it work. And don't expect your XS module to compile quickly if you have a lot of regexps. The 10k regexp test took over 3 hours to compile on my Core Duo 2Ghz.
[*] re2c uses an entirely different format for regexps than perl does. So the core of re2xs is a regexp parser which converts to this other format.
but I still can't figure outmy ($regexp, $reason) =/^(.*):(.*)$/;
$regexp =~ s/^\.\+//;
print $re "\t", fixup_re($regexp), " {return \"$reason\";}\n";
Could you please provide a 3 (or so) line data file that we could just compile and run in some little demo script? Just to show the intention of it all.
Re:Example please?
Matts on 2006-08-12T14:06:04
Yeah sorry - it's a hack that I didn't have any time to document.
Input is a file that looks like this:i.e. "regexp:reason"(duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
[0-9a-f]+\.myntet\.ac:generi c
[0-9]+[a-z]\.old\.myntet\.ac:generic
lgh[0-9]+\-p[0-9]+\.nejlikan\.ac:edu
Then to run, it's just:use MyModule;
if (my $reason = MyModule::scan($string)) {
print "Matched: $reason\n";
}Re:Example please?
bart on 2006-08-12T22:55:52
OK, I tried it... and it works. Kind of — I had to delete the "time" in the system() call as Windows doesn't support it. Anyway, the sample compiled fast. Very fast. You scared me for no reason.:)
I'm somewhat disappointed with what the module can do. I was hoping to have a basis to reimplement URI::Find, thus: something that can find matches anywhere in a random text. There's two major reasons why it can't do that. First: it really is a lexer: it can only match prefixes in a string. To use your example:matches both "duc123.old.fagotten.ac" and "duc123.old.fagotten.acdc", but not "viaduc123.old.fagotten.ac". Second: it doesn't return what it matched, or even the length of the match, it just returns a "reason", in this case: the string "generic". That's not very useful, even for a lexer.(duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
What I would really love to see, is a search for a substring, using a scheme that may resemble Boyer-Moore for skipping over uninteresting submatches: based on what was seen earlier, you just know some prefixes just can't match, and you can just skip them. Opposed to fixed search string, I'm guessing that this will be anything but trivial — what a regex matches has neither a fixed length nor fixed characters in each position. I have doubts that it's even possible, in the generic case.
I don't expect that re2c would support anything even remotely like it.
In summary: I find it simply amazing work how easy you make it to generate an XS module from code generated using an external tool. As for the limitations to make it really useful, for the applications I'm thinking of, are probably limitations in what re2c can do.Re:Example please?
jmason on 2006-08-12T23:19:31
you might want to keep an eye on http://svn.apache.org/viewvc/spamassassin/branches/jm_re2c_hacks/rule2xs/ -- I'm hacking away on it for SpamAssassin, and I think with work you could probably find those features working there.
Matt, does it really support [classes], (alt|er|nations), and {quantifiers}? wow, I wasn't even expecting that!! holy crap.Re:Example please?
Matts on 2006-08-13T00:21:57
To match anywhere in the string prefix with ".*". I haven't quite figured out how to tie to the end of the string yet, but it should be doable with what the code generates.
The limitation on just returning the reason is arbitrary - that's all I needed for the given problem domain, but you can definitely return "what matched" and "where in the string?". That should be a simple matter of programming.
(the long compile times are for when you have LOTS of regexps - I compile over 15k into one module).