I've finished an alpha version of my Regexp::Token module and included Regexp::Token::HTML as a "starter kit", if you will. Basically, the module allows you to match arbitrarily defined tokens in addition to characters. Frankly, I don't know how useful people might find it and I've already had two comments from readers of my Perlmonks posting about this to the effect that they don't understand what I'm trying to do. I'm going to have to take some time to write this up more carefully and come up with some comparative examples.
my $p_token = Regexp::Token::HTML->create_token(''); my $p_tag = Regexp::Token->create($p_token); $html = <
testing
end test
END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{
}; is($result, $two_tags, '... and we should be able to capture token text');
I'm also getting some weird errors from the module and I need to find out where my undefined errors are coming from. And if anyone is familiar with things to watch out for in forking code, I would love it if you could review what I'm doing and let me know if there are any dangers to watch out for.
Update: A slightly updated version of Regexp::Token gets rid of the warnings and passes the tests much better. It also gets rid of an ugly hack and uses {(?!}) to fail a match.
The problem with doing the latter is that most things people want to parse are too sophisticated for parsing with regular expressions. The Perl5 regex engine is more powerful than standard regular expressions and can match higher-level grammars. I suspect that the regular expressions used would become nasty and slow.
This is the reason people use hybrid parsers on programming languages. The tokens are matched with regular expressions, which are much faster and easier to understand with simple expressions. Assembling the tokens into a parse tree requires a more powerful parser to match the context-free grammar.