Scraping HTML with XML::LibXML

rats on 2005-02-22T02:31:05

Writing a test script to hit a webpage and scrape out enough from the HTML response to verify it is correct...

First test is to (stop and) start my fake xmlrpc server with the response file I want and confirm it's alive. Hmmm. RPC::XML t/* tests do lots of that so let's steal/borrow some code. Hmmm. Net is down (firewall machine again probably). minicpan to the rescue. Minicpan has saved my bacon so many times I've lost count...

Well that was relatively painless. Randy J Ray writes nice clean intelligible Perl code. My script loads an RPC::XML::Server with the canned methods and forks it to a background process then gets a page from my web app to confirm the xmlrpc server is running correctly.

Now comes the fun. I hate HTML scraping but if I have to do it, I really like to use XML::LibXML. Aside from being very fast at parsing (which isn't important for this app), I can use XPath notation to navigate the DOM tree and, even better, there's xsh to let me try out my XPaths interactively. Yes it's possible to read the HTML code and keep track of how many levels of table/tr/td you are down by hand but why waste hours when with xsh you can do this in minutes.

Ouch! A small problem. LibXML expects xhtml and crashes all over the place when I ask xsh to parse the HTML output of my webapp. Lucky(!) for me (another reason for choosing CGI::Application) I have moved all the HTML from the old webapp into HTML::Template templates. So it's really easy to rewrite it as xhtml using Vim. (I discovered after rewriting by hand that one of the options for tidy is --asxhtml. It outputs HTML as xhtml. Double d'oh!)

So now I've got clean xhtml output I can use xsh to navigate through the parsed tree and find the fields I expect to see in the page if the webapp is working correctly. The first one I want has an XPath of

/html/body/table/tr[2]/td[2]/form/a[6]
thanks to xsh. Glad I didn't have to work that one out by hand. It's a link to expand the tree. So I'll use WWW::Mechanize to click the link and grab the response and verify it returned the required number of tables in the correct order with the correct contents. And then on to the next test...


Parsing HTML with LibXML

grantm on 2005-02-22T03:26:18

LibXML expects xhtml and crashes all over the place when I ask xsh to parse the HTML output of my webapp

You know you can use parse_html* methods and set ->recover(1) to parse poorly formed HTML, right? I don't know if xsh supports this but if not, it should be easy to hack in.

Re:Parsing HTML with LibXML

tomhukins on 2005-02-22T14:35:39

xsh does indeed work fine with LibXML's recover mode. Type help recovering and help open in the shell for details.

Re:Parsing HTML with LibXML

rats on 2005-02-22T23:05:07

'Crash' was being too harsh. xsh (i.e. LibXML) actually spits out a warning for each error in the HTML/XML with recover on. I *want* those errors to display because I am able to fix them in the production templates.

But thank you (and grantm) for the heads up.