XPath

gnat on 2003-07-01T21:25:54

One of the themes I heard consistently at YAPC was that every XML hacker who learned XPath said their life had changed. I played a little with XPath last night while writing the XML chapter for Cookbook 2ed, and I have to say--they're right. XPath is incredible.

For those of you who don't know, XPath provides a kind of structural regular expression for finding the bits of an XML parse tree that you're interested in. For example, here's the XPath for "the href attributes of a tags":

//a/@href
The double slash means "anywhere in the tree" (a single slash would mean you're giving a fully-qualified path to a node, starting from the root element) and @ indicates you're talking about an attribute and not an element.

Put in context, here's a program to print all the href attribute values of a tags:

#!/usr/bin/perl -w

use XML::LibXML;
use strict;

my $parser = XML::LibXML->new;
my $dom = $parser->parse_fh(\*DATA);
my @nodes = $dom->findnodes('//a/@href');

foreach my $node (@nodes) {
  print $node->value, "\n"; }

__END__

The perl.org web site is much sexier than that of Python.

That's so much easier than walking trees! Of course, an event-based parser could have handled this easily too. But when you have more complex requirements for things to extract, XPath even beats SAX for convenience.

For example, let's find the href attributes of a tags whose text contains Perl:

//a[contains(text(), "perl")]/@href
We can even go back up the tree, for example to find the p elements containing links:

//a/ancestor::p
XPath expressions get hairy fast, like regular expressions:
//a[contains(text(), "perl")]/ancestor::p
Like Perl 6 regular expressions, XPath expressions can be padded with whitespace (also similar, there are a few places where whitespace is significant):
//a
  [ contains( text(), "perl" ) ]
/ancestor::p
The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath. The two pages I tried to parse were the O'Reilly Catalog and the use.perl homepage, and both had badly-formed HTML (overlapping tags, etc.). The W3C HTML validator had conniptions, as did XML::LibXML. I guess for real-world (i.e., broken) HTML, you still have to use regexps and the parsers described in TorgoX's book.

--Nat


XPath

ziggy on 2003-07-01T21:45:06

XPath is incredible.
Yep. Wrap your brain around this hack:
document(//a/@href[contains(., '.html')])/html/head/title
In the context of an XSLT stylesheet (or something that provides the document() function to retrieve a document by name), this little tidbit finds all of the links containing .html in the href, fetches them, parses them, and returns the title of each page.

A spider. In one expression.

Assign that to a nodeset and reapply the expression, and you're going two levels out. (Or just nest the document() functions into something really contorted.)

The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath.
Did you try munging the HTML with tidy first? That works a decent amount of the time. (You can have tidy emit XML/XHTML if you don't want to deal with HTML parsers.)

Re:XPath

gnat on 2003-07-01T23:15:48

Wow, tidy is great! Thanks for the tip!

(Morbus, you getting this for Spidering Hacks? :-)

--Nat

Re:XPath

deltab on 2003-07-02T18:45:51

Dammit, this is the secret I was going to reveal under the heading "When XPath won't work (and how to make it work anyway)".

Re:XPath

johnwcowan on 2004-10-25T03:15:04

You can use TagSoup (http://tagsoup.info), my SAX parser for HTML. I also have a version of Saxon 6 packaged with TagSoup for XSLT-ing arbitrary HTML.

Perl XPath functions

garron on 2003-07-01T22:16:23

Nat -

Don't forget to talk about XML::LibXSLT's ability to write and register XPath extension functions written in Perl. :-)

Of course 1.53 has memory bugs, but if you get Matt's CVS copy, you can have Perl callbacks from XSLT. This is incredibly useful; say you want access Apache req objects from XSLT, using closures, in a handler().

    $xslt->register_function($urn, 'get_request', sub { &get_request($self,@_) } );

Write get_request() to handle arguments to an XPath function (which can be strings or XML, and you can return (from the subroutine) strings or XML fragments to the XSLT XPath function. I use it to get parameters, set cookies, login users, etc. It's transparent as far as the XSLT goes.

Re:Perl XPath functions

gnat on 2003-07-02T06:38:27

Thanks for volunteering to send me sample code :-)

--Nat

LibXML and HTML

gav on 2003-07-01T22:52:18

See the parse_html_* methods in LibXML.

Re:LibXML and HTML

gnat on 2003-07-02T07:57:36

I wasn't sufficiently clear in my original message. I was trying the parse_html_* methods in XML::LibXML and they were whining about broken HTML in the two pages I was playing with. So I said "screw it" and sent back to parsing those with HTML::* modules.

--Nat

Re:LibXML and HTML

gav on 2003-07-02T11:34:59

Doh. HTML parsers that can't parse broken HTML aren't that useful :)

Have you tried HTML::TreeBuilder with Class::XPath?

Re:LibXML and HTML

gnat on 2003-07-02T15:45:02

I haven't, but boy that's really cute. I was wondering the other day whether there were more general XPath modules available. You know, with a little optimization (the ability to search a tree once but have multiple possible XPath expressions and associated actions to run at each step), you could use XPath as the basis for your optimizer--write XPath expressions for the things to optimize.

Ah yes, I've known about XPath for three days. Why wouldn't I assume I've had an original thought :-)

--Nat

Re:LibXML and HTML

Matts on 2003-07-02T16:47:39

$parser->recovery(1)
Fixes that problem.

Re:LibXML and HTML

gnat on 2003-07-02T17:25:28

Well, bollocks :-) I'd even seen that option in the manpage. This is what comes of doing your work at 3am, I guess ...

Thanks!

--Nat