Leave the HTML alone!

brian_d_foy on 2004-01-23T12:13:39

I have a lot of web scrappers that pull down information and display it to me so I don't have to do a lot of pointing-and-clicking and scrolling.

My Mac has not been on the network for a while, so when I tried out my scrappers the other day, and most of them failed because the regular expressions do not end up matching anything.

Debugging this thing while I am paying $6/hour is a bit more than I want to pay, so I have a new trick.

All of these scrappers cache a lot of intermediate results in a hidden directory---so much so that some may think I overdo it a bit. Despite that, I have never saved the original web page.

I thought I could just save the web page from my browser, but I tried to save them as "Web Page Complete", which munges the HTML so when I look at it offline, it finds all of the right supporting files on my computer. The HTML in "Web Page Complete" is not the HTML my regexen see.

I modified these programs to save the real HTML so I can look at it later, but now I am a bit sad because this used to be so easy, and now there are more layers to looking at the plain source.


Save as HTML

bart on 2004-01-23T16:04:05

If you want to save the original contents of web pages, instead of what browsers make of it, the simplest solution is to make a little script using LWP/LWP::Simple (getstore() is virtually ideal), or, if that's not smart enough, WWW::Mechanize.

Re:Save as HTML

brian_d_foy on 2004-01-23T20:23:56

Indeed, and that is what I am doing.