HTML regex engine

Ovid on 2002-12-20T00:24:46

Well, no, I don't have an HTML regex engine, but I needed to do this:

s/$old_html/$new_html/g;

Any programmer who's worked with HTML for more than about three seconds has quickly discovered that this is not a viable option. Is the HTML well-formed? Does it have extra whitespace? Did they quote their attributes properly? Is the attribute case consistent? It's a frustration.

Today, after a fair amount of searching and asking questions in the Perlmonks chatterbox, I discovered that I simply couldn't find an adequate tool to do this. Some of the treebuilder tools looked interesting, but the fine-grained control that I needed wasn't there, so I built such a tool. It's not done, but so far I can tell it to match a given document structure to another document structure and, if they match, replace the target HTML with the new HTML. I can ignore attributes, force them to be in the correct order or ignore their order, if I wish. I'm now going to start working on the text matching portion. It's been fun. Perhaps this is a CPAN module in the works?

Of course, if my past experiences with use.perl are any indication, someone's going to say "here's what you were looking for". I'd welcome that as I'm curious to see how my tool stacks up.


tidy?

mir on 2002-12-20T00:38:54

As I stated on Perlmonks today ;--): Did you try tidy? http://tidy.sourceforge.net/

Re:tidy?

Ovid on 2002-12-20T01:46:00

I didn't see your reply on Perlmonks! In any event, I'm using HTML::TokeParser::Simple to get around the problems with bad HTML and it's worked fine. My basic method is to parse my sample HTML, create a bunch of tokens that I store in an array. Then, with the target HTML, I do the same thing and if, at any point, I have a matching token stream, I do the replacement. So far, it's worked out much better than I thought.