Regex::PreSuf with my coffee

Phred on 2005-11-24T18:27:27

I was enjoying my coffee and doing some catch up work this morning when my sysadmin friend IM's me asking if I know of a way to generate Perl compatible regular expressions from a set of words. A quick trip to the CPAN revealed Regex::PreSuf. A couple minutes later I emailed him this script from the command line.

#!/usr/bin/env perl

use strict;
use warnings;

use Regex::PreSuf;

# Put in the words you want to match here
my @words = qw( foo bar blitz );

my $re = presuf( @words );
print $re;

As a bonus, the docs say that the regexs generated are usually faster than using alternation. I can think of a few places in my code to refactor already :)


Don't forget about the alternatives...

bart on 2005-11-24T18:52:36

I've been told these are not just examples of the phenomenon known as "reinventing the wheel": the authors allegedly knew of Regex::PreSuf, and made improved versions. Hopefully. So, it might be worthwhile to actually compare these modules...

Re:

Aristotle on 2005-11-25T02:57:53

Beat me to the punch… Regexp::Assemble is the one I generally use.

Re:Don't forget about the alternatives...

grinder on 2005-11-27T11:01:33

As the author of Regexp::Assemble, let me weigh in:

Yes, I knew about Regex::PreSuf (and it is referenced in the SEE ALSO section of the documenation). R::PS doesn't deal with meta characters, so something like a\d+b and a\s+d is going to produce a\[ds]+b, which won't even compile.

Regexp::List, I knew about, but you'll forgive me if I can't quite recall why I discarded it when I evaluated it. I think it gets exponentially slower as the input list grows.

Regexp::Assemble comes with a number of scripts in the eg/ directory. One of which, assemble, allows you to create the regexp from the command line.

Given a file containing the text:

Perl is a language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).

Then you can assemble a regular expression from the words without writing a scrap of code (apart, perhaps, a one-liner to break the strings up into words)...:

perl -nle 'print lc $1 while /([a-z'"'"']+)/gi' perl.txt | assemble

Which produces:

(?:t(?:h(?:(?:os)?e|a[nt])|asks|ext|iny|o)|e(?:(?:fficie|lega)nt|xtracting|asy)| i(?:n(?:formation|tended)|(?:t')?s)|p(?:r(?:actical|inting)|erl)|m(?:an(?:agemen t|y)|inimal)|(?:complet|languag|us)e|b(?:e(?:autiful)?|ased)|a(?:rbitrary|lso|nd )?|s(?:canning|ystem)|r(?:eports|ather)|f(?:iles|rom|or)|o(?:ptimized|n)|good)

You can also tell it to put in zero-width lookahead assertions if you think it would make the pattern match (or fail) faster. Of course, if you know your input text contains no metacharacters, Regex::PreSuf is fine.

Re:Don't forget about the alternatives...

Phred on 2005-11-27T18:14:13

Thanks for the weigh in! This looks like the industrial strength solution I will put into production. I need to dive into tries also and get a good understanding of those. I like the as_string method for readability here.

Another fun morning with Perl and Coffee!