One of the themes I heard consistently at YAPC was that every XML hacker who learned XPath said their life had changed. I played a little with XPath last night while writing the XML chapter for Cookbook 2ed, and I have to say--they're right. XPath is incredible.
For those of you who don't know, XPath provides a kind of structural regular expression for finding the bits of an XML parse tree that you're interested in. For example, here's the XPath for "the href attributes of a tags":
The double slash means "anywhere in the tree" (a single slash would mean you're giving a fully-qualified path to a node, starting from the root element) and @ indicates you're talking about an attribute and not an element.//a/@href
Put in context, here's a program to print all the href attribute values of a tags:
That's so much easier than walking trees! Of course, an event-based parser could have handled this easily too. But when you have more complex requirements for things to extract, XPath even beats SAX for convenience.#!/usr/bin/perl -w use XML::LibXML; use strict; my $parser = XML::LibXML->new; my $dom = $parser->parse_fh(\*DATA); my @nodes = $dom->findnodes('//a/@href'); foreach my $node (@nodes) { print $node->value, "\n"; } __END__The perl.org web site is much sexier than that of Python.
For example, let's find the href attributes of a tags whose text contains Perl:
We can even go back up the tree, for example to find the p elements containing links://a[contains(text(), "perl")]/@href
XPath expressions get hairy fast, like regular expressions://a/ancestor::p
Like Perl 6 regular expressions, XPath expressions can be padded with whitespace (also similar, there are a few places where whitespace is significant)://a[contains(text(), "perl")]/ancestor::p
The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath. The two pages I tried to parse were the O'Reilly Catalog and the use.perl homepage, and both had badly-formed HTML (overlapping tags, etc.). The W3C HTML validator had conniptions, as did XML::LibXML. I guess for real-world (i.e., broken) HTML, you still have to use regexps and the parsers described in TorgoX's book.//a [ contains( text(), "perl" ) ] /ancestor::p
--Nat
Yep. Wrap your brain around this hack:XPath is incredible.
In the context of an XSLT stylesheet (or something that provides thedocument(//a/@href[contains(., '.html')])/html/head/title
document()
function to retrieve a document by name), this little tidbit finds all of the links containing.html
in the href, fetches them, parses them, and returns the title of each page.
A spider. In one expression.
Assign that to a nodeset and reapply the expression, and you're going two levels out. (Or just nest the document()
functions into something really contorted.)
Did you try munging the HTML withThe only beef I have about it is that I can't find a robust HTML parser that'll give me XPath.
tidy
first? That works a decent amount of the time. (You can have tidy emit XML/XHTML if you don't want to deal with HTML parsers.)
Re:XPath
gnat on 2003-07-01T23:15:48
Wow, tidy is great! Thanks for the tip!(Morbus, you getting this for Spidering Hacks?
:-) --Nat
Re:XPath
deltab on 2003-07-02T18:45:51
Dammit, this is the secret I was going to reveal under the heading "When XPath won't work (and how to make it work anyway)".Re:XPath
johnwcowan on 2004-10-25T03:15:04
You can use TagSoup (http://tagsoup.info), my SAX parser for HTML. I also have a version of Saxon 6 packaged with TagSoup for XSLT-ing arbitrary HTML.
Re:Perl XPath functions
gnat on 2003-07-02T06:38:27
Thanks for volunteering to send me sample code:-) --Nat
Re:LibXML and HTML
gnat on 2003-07-02T07:57:36
I wasn't sufficiently clear in my original message. I was trying the parse_html_* methods in XML::LibXML and they were whining about broken HTML in the two pages I was playing with. So I said "screw it" and sent back to parsing those with HTML::* modules.--Nat
Re:LibXML and HTML
gav on 2003-07-02T11:34:59
Doh. HTML parsers that can't parse broken HTML aren't that useful:)
Have you tried HTML::TreeBuilder with Class::XPath?Re:LibXML and HTML
gnat on 2003-07-02T15:45:02
I haven't, but boy that's really cute. I was wondering the other day whether there were more general XPath modules available. You know, with a little optimization (the ability to search a tree once but have multiple possible XPath expressions and associated actions to run at each step), you could use XPath as the basis for your optimizer--write XPath expressions for the things to optimize.Ah yes, I've known about XPath for three days. Why wouldn't I assume I've had an original thought
:-) --Nat
Re:LibXML and HTML
Matts on 2003-07-02T16:47:39
Fixes that problem.$parser->recovery(1)Re:LibXML and HTML
gnat on 2003-07-02T17:25:28
Well, bollocks:-) I'd even seen that option in the manpage. This is what comes of doing your work at 3am, I guess ... Thanks!
--Nat