I was enjoying my coffee and doing some catch up work this morning when my sysadmin friend IM's me asking if I know of a way to generate Perl compatible regular expressions from a set of words. A quick trip to the CPAN revealed Regex::PreSuf. A couple minutes later I emailed him this script from the command line.
#!/usr/bin/env perl use strict; use warnings; use Regex::PreSuf; # Put in the words you want to match here my @words = qw( foo bar blitz ); my $re = presuf( @words ); print $re;
As a bonus, the docs say that the regexs generated are usually faster than using alternation. I can think of a few places in my code to refactor already :)
Re:
Aristotle on 2005-11-25T02:57:53
Beat me to the punch… Regexp::Assemble is the one I generally use.
Re:Don't forget about the alternatives...
grinder on 2005-11-27T11:01:33
As the author of Regexp::Assemble, let me weigh in:
Yes, I knew about Regex::PreSuf (and it is referenced in the SEE ALSO section of the documenation). R::PS doesn't deal with meta characters, so something like a\d+b and a\s+d is going to produce a\[ds]+b, which won't even compile.
Regexp::List, I knew about, but you'll forgive me if I can't quite recall why I discarded it when I evaluated it. I think it gets exponentially slower as the input list grows.
Regexp::Assemble comes with a number of scripts in the eg/ directory. One of which, assemble, allows you to create the regexp from the command line.
Given a file containing the text:
Perl is a language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).Then you can assemble a regular expression from the words without writing a scrap of code (apart, perhaps, a one-liner to break the strings up into words)...:
perl -nle 'print lc $1 while
/([a-z'"'"']+)/gi' perl.txt | assemble Which produces:
(?:t(?:h(?:(?:os)?e|a[nt])|asks|ext|iny|o)|e(?:(?:fficie|lega)nt|xtracting|asy)
| i(?:n(?:formation|tended)|(?:t')?s)|p(?:r(?:actical|inting)|erl)|m(?:an(?:agemen t|y)|inimal)|(?:complet|languag|us)e|b(?:e(?:autiful)?|ased)|a(?:rbitrary|lso|nd )?|s(?:canning|ystem)|r(?:eports|ather)|f(?:iles|rom|or)|o(?:ptimized|n)|good)You can also tell it to put in zero-width lookahead assertions if you think it would make the pattern match (or fail) faster. Of course, if you know your input text contains no metacharacters, Regex::PreSuf is fine.
Re:Don't forget about the alternatives...
Phred on 2005-11-27T18:14:13
Thanks for the weigh in! This looks like the industrial strength solution I will put into production. I need to dive into tries also and get a good understanding of those. I like the as_string method for readability here.
Another fun morning with Perl and Coffee!