HTML Tree (DOM) + XPath = Element. The other way round?

miyagawa on 2007-08-10T05:25:39

Modules like HTML::TreeBuilder::XPath and HTML::Selector::XPath is very useful to extract content from HTML DOM tree using XPath expressions or CSS selectors. These modules do the following:

HTML DOM Tree + XPath expression => The element you want

Is there an other way round to do this? I mean,

HTML DOM Tree + The element you want => XPath expression

I know Mozilla extension allows to do this with GUI, but it's well known that the generated XPath is kinda bogus because it adds extra tbody etc. and useless when you don't use Gecko engine.

The module would share the concept with Template::Extract, which does creation of TT templates using stash variales and the generated output.

If anyone knows the prior work to do this, let me know. Otherwise I'll begin writing a module for it, to make using Web::Scraper much easier. It'd be nice to add to my YAPC::EU talk.

And yes, all problems regarding my flight and hotel in Vienna seem to be sorted and I'll be in. Yay!


Possibly:

Aristotle on 2007-08-10T09:09:36

Are you asking about XML::LibXML::Node’s nodePath method?

Re:Possibly:

dakkar on 2007-08-10T09:20:44

More or less what I was going to suggest.

Keep in mind that 'nodePath' will return something like:

/html/body/div[3]/table/tr[2]/td[5]

Which, while correct, might not be the most flexible specification... maybe you really wanted:

/html/body/div[h2='The table']/table/tr[td[1]='this row']/td[position()=../../tr[1]/td[.='this column']/position()]

Re:Possibly:

miyagawa on 2007-08-10T09:36:23

Exactly. That's what I don't like with Mozilla extension way too.

I might want the module to generate multiple possible XPath expressions so that the user can pick, to generate the scraper thing that's most reliable.

Re:Possibly:

Aristotle on 2007-08-10T10:55:47

You’ll run into combinatorial explosion for even a relatively short path. There are extremely many ways to address a single element.

I guess what you want, given your comparison with Template::Extract, is a way to accept multiple nodes and then ask for the strictest possible XPath expression (including shared attribute values on any ancestral elements etc) that matches them all.

Hmm, that would be cool.

Re:Possibly:

miyagawa on 2007-08-10T09:34:56

Yeah, this is quite similar to what I have in mind, except it's libxml based (I want one for HTML::Tree for some reason). But it'll be definitely helpful. Thank you!