HTML::Selector::XPath

miyagawa on 2006-10-02T16:54:32

Per discussions in CSS Selector in Perl, I made a quick perl module HTML::Selector::XPath, which is available at http://svn.bulknews.net/repos/public/HTML-Selector-XPath/trunk now.

The code is based on javascript code available on http://dev.rubyonrails.org/ticket/5171> which looks a little buggy, and was slightly modified using more accurate table on (Thanks Aristotle!)

See the test suite
02_html.t how to use this module combined with HTML::TreeBuilder::XPath (yeah, I plan to release a glue module for H::TB::XPath anyway), to extract content from (X)HTML using Xpath expression. Now your scraping code is hopefully free from nasty regexps!

I'll upload this module to CPAN shortly but give it a shot if you're interested.


CSS3 support

Aristotle on 2006-10-03T11:07:16

Hmm, it seems entirely possible to express all of CSS 3 in terms XPath 1.0; no XPath 2.0 required.

I just haven’t gotten to it – honestly, because I was too lazy. CSS 3 has many more syntax elements than the CSS 2 and the new ones are much more complex, so it’s not quite the same kind of 5-minute job.

Re:CSS3 support

miyagawa on 2006-10-03T11:12:34

Really? That sounds great. I was translating :not() CSS 3 selector but couldn't find how to map to XPath 1.0 without using :not(). Maybe I'm missing something obvious?

Re:CSS3 support

Aristotle on 2006-10-03T11:41:50

Seems to me that a [not(subexpr)] predicate should work. The only trick is to get any references to the context node right in subexpr , I suppose by using self::* or something.

Actually, now that you have written the module I may get around to it sooner, since there are working unit tests in there…

Re:CSS3 support

miyagawa on 2006-10-03T13:45:45

Aha, cool. Now I fixed how to handle :not() pseudo-class and map it to [not()], which worked. See updated unit test to confirm. Thanks!

Re:CSS3 support

Aristotle on 2006-10-03T14:36:50

Add a case for *:not(p) and see if that works. The correct translation should be *[not(self::p)], I think.

Re:CSS3 support

miyagawa on 2006-10-03T14:55:21

It doesn't work, at least for now.

To support that I should rewrite the parser algorithm somehow, and it will be done when I decide to do a complete CSS 3 selectors support. For now it'll croak.

Why not XML::LibXML in HTML mode?

merlyn on 2006-10-03T14:42:58

I've used XML::LibXML in HTML mode. It'll be far faster, and when you wrap it with the xsh language, it's even better (and xsh version 2.0 is getting some very neat features).

Re:Why not XML::LibXML in HTML mode?

miyagawa on 2006-10-03T14:46:57

Yeah, XML::LibXML in HTML mode would work, too.

I just picked HTML::TreeBuilder::XPath because i thought it'd be more relaxed to handle non-well-balanced HTML. XML::LibXML::Parser says "HTML (strcit) documents" and that makes me a little nervous :)

Re:Why not XML::LibXML in HTML mode?

Aristotle on 2006-10-03T21:05:06

libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.

I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.

TagSoup also works. (Someone should port that one to Perl and/or C…)