XPath is just great at screenscraping, especially when combined with libxml2's xmllint tool for turning html into XML...
Here's the current temperature in london:
$ xmllint --html --format http://www.bbc.co.uk/weather/5day.shtml?world=0008 | xpath 'normalize-space(string((//tr[starts-with(normalize-space(.), "Temperature")])[2]))'
I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change. In fact they usually change it quite often, as you have to keep web designers busy
Not that I like SOAP myself, mind you...
Re:HTML is not an interface!
Matts on 2002-04-24T14:12:13
The problem is SOAP doesn't always exist.Re:HTML is not an interface!
ziggy on 2002-04-24T14:27:35
I'm not so sure Matt is playing devil's advocate. I think he's got his pragmatist hat placed squarely upon his head.I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change.It would be nice from an ideological point of view if this information were in a constant format (XML-RPC, SOAP, plain XML or even a reasonably static XHTML layout). Realistically, that's not going to happen on a large scale any time soon, regardless of what the SOAP-hype would have you believe.
We can't take screen scrapers out of our toolkit. And we need better screen scrapers.
Re:HTML is not an interface!
darobin on 2002-04-24T15:33:00
Hmmmm, this whole thing is starting to make me wonder if it's not time that I should grab my old XML+CSS bat out of the cupboard and start practising a few swings... wouldn't it indeed be cool if REST style services were both human and computer readable?
Who knows, maybe this time around it won't be just Simon St. Laurent and/or me vs. xml-dev...
Is this meant to mean 16C / 60F?http://www.bbc.co.uk/weather/5day.shtml?world=0008:115: error: htmlParseEntityRef: no name
Helvetica" SIZE="2"><a href="/weather/sports/index.shtml" class="index">Sport
^
http://www.bbc.co.uk/weather/5day.shtml?world=0008:122: error: htmlParseEntityRef: expecting ';'
<FONT FACE="Arial, Helvetica" SIZE="2"><a href="/cgi-bin/forums/cgi?fid=750&pg
^
http://www.bbc.co.uk/weather/5day.shtml?world=0008:136: error: Attribute style redefined
llspacing="0" cellpadding="0" border="0" style="margin:0px;" style="float:left;
^
http://www.bbc.co.uk/weather/5day.shtml?world=0008:213: error: htmlParseEntityRef: expecting ';'
<LI><A HREF="/cgi-bin/weather/setcookies.pl?0008&setweathercookie"><FO
^
Query didn't return a nodeset. Value: Temperature (��C / ��F) 16 / 60
Re:I'm not sure what this was meant to do
Matts on 2002-04-24T15:30:51
"Is this meant to mean 16C / 60F?"
Yes.
The other stuff I'm not sure how to turn off in xmllint. Perhaps 2>/dev/null;-) Re:I'm not sure what this was meant to do
ziggy on 2002-04-24T15:46:04
Try using HTML Tidy to convert HTML to XML before passing it through xpath. It'll likely generate some errors, but it tends to be more forgiving with HTML.