Scr.*?nscraping

TorgoX on 2004-12-08T02:08:08

Dear Log,

So when I go to write another HTML-scraper like this, I often start by copying a block of the HTML that I want to capture repetitions of up and down the template-generated page, and I paste it into the STDIN of this little utility. I hit return and control-D, and then out comes a big dumb regexp that loosely matches that piece of input. I take out the bits of text that I know will vary, and I replace them with (.*?) or the like, and voilà, screenscraper.


That's really cool

samtregar on 2004-12-08T03:05:51

This one is definitely going in the toolbox. Thanks!

-sam

Re:

Aristotle on 2004-12-08T18:45:40

s/([\.\(\)\^\$\@\[\]\*\?\+\{\}\#\\])/\\$1/g;

Is there a difference from $_ = quotemeta $_;?

Re:

TorgoX on 2004-12-08T22:13:05

I think quotemeta is a bit more verbose. Like it seems to quote everything that's not \w.

Re:

Aristotle on 2004-12-09T01:06:57

Ah, yes. Actually I just remembered that quotemeta is in fact problematic for this purpose because backslashes are not treated the same in regex quoting context vs double-quote quoting context. Which annoyed me when I was trying to interpolate user data in a the s/// pattern in a string to be evaled.