Who needs SOAP!

Matts on 2002-04-24T13:23:14

XPath is just great at screenscraping, especially when combined with libxml2's xmllint tool for turning html into XML...

Here's the current temperature in london:

$ xmllint --html --format http://www.bbc.co.uk/weather/5day.shtml?world=0008 |
  xpath 'normalize-space(string((//tr[starts-with(normalize-space(.), "Temperature")])[2]))'


(the above finds all the <tr>'s who's text content starts with "Temperature" (of which there are two on that page), then takes the second one of those (which is the current temperature), and then does a normalize-space on the string value of that (which means strip all the tags, basically))

I personally think using XPath for screen scraping is a bit easier than other methods of doing the same, and possibly safer too. Plus you can quite nicely apply this technique to all sorts of useful systems.


HTML is not an interface!

mir on 2002-04-24T13:46:53

I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change. In fact they usually change it quite often, as you have to keep web designers busy ;--( OTOH if they setup a SOAP server you might have a better chance at the interface being more stable, or at least at your script complaining about a change, instead of breaking silently. Upgrading should be easier too.

Not that I like SOAP myself, mind you...

Re:HTML is not an interface!

Matts on 2002-04-24T14:12:13

The problem is SOAP doesn't always exist.

Re:HTML is not an interface!

ziggy on 2002-04-24T14:27:35

I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change.
I'm not so sure Matt is playing devil's advocate. I think he's got his pragmatist hat placed squarely upon his head.

It would be nice from an ideological point of view if this information were in a constant format (XML-RPC, SOAP, plain XML or even a reasonably static XHTML layout). Realistically, that's not going to happen on a large scale any time soon, regardless of what the SOAP-hype would have you believe.

We can't take screen scrapers out of our toolkit. And we need better screen scrapers.

Re:HTML is not an interface!

darobin on 2002-04-24T15:33:00

Hmmmm, this whole thing is starting to make me wonder if it's not time that I should grab my old XML+CSS bat out of the cupboard and start practising a few swings... wouldn't it indeed be cool if REST style services were both human and computer readable?

Who knows, maybe this time around it won't be just Simon St. Laurent and/or me vs. xml-dev...

I'm not sure what this was meant to do

2shortplanks on 2002-04-24T15:15:29

but this is what I get
http://www.bbc.co.uk/weather/5day.shtml?world=0008:115: error: htmlParseEntityRef: no name
Helvetica" SIZE="2"><a href="/weather/sports/index.shtml" class="index">Sport
                                                                              ^
http://www.bbc.co.uk/weather/5day.shtml?world=0008:122: error: htmlParseEntityRef: expecting ';'
    <FONT FACE="Arial, Helvetica" SIZE="2"><a href="/cgi-bin/forums/cgi?fid=750&pg
                                                                              ^
http://www.bbc.co.uk/weather/5day.shtml?world=0008:136: error: Attribute style redefined
llspacing="0" cellpadding="0" border="0" style="margin:0px;" style="float:left;
                                                                              ^
http://www.bbc.co.uk/weather/5day.shtml?world=0008:213: error: htmlParseEntityRef: expecting ';'
            <LI><A HREF="/cgi-bin/weather/setcookies.pl?0008&setweathercookie"><FO
                                                                          ^
Query didn't return a nodeset. Value: Temperature (&#65533;&#65533;C / &#65533;&#65533;F) 16 / 60
Is this meant to mean 16C / 60F?

Re:I'm not sure what this was meant to do

Matts on 2002-04-24T15:30:51

"Is this meant to mean 16C / 60F?"

Yes.

The other stuff I'm not sure how to turn off in xmllint. Perhaps 2>/dev/null ;-)

Re:I'm not sure what this was meant to do

ziggy on 2002-04-24T15:46:04

Try using HTML Tidy to convert HTML to XML before passing it through xpath. It'll likely generate some errors, but it tends to be more forgiving with HTML.

xpath...

pault12 on 2002-04-24T17:43:06

hrgab - that's the way I read the internet ;-)



XSLScript (xpath) + Chunks + SQL + perl



XPath is good for trees, but it sucks with 'flat' things (such as mixed content). It can be improved (see BiXpath it is kinda 'derived' from perl regexprs)



Overal - I agree that Xpath is the best existing thing.