Per discussions in CSS Selector in Perl, I made a quick perl module HTML::Selector::XPath, which is available at http://svn.bulknews.net/repos/public/HTML-Selector-XPath/trunk now.
The code is based on javascript code available on http://dev.rubyonrails.org/ticket/5171> which looks a little buggy, and was slightly modified using more accurate table on
See the test suite 02_html.t how to use this module combined with HTML::TreeBuilder::XPath (yeah, I plan to release a glue module for H::TB::XPath anyway), to extract content from (X)HTML using Xpath expression. Now your scraping code is hopefully free from nasty regexps!
I'll upload this module to CPAN shortly but give it a shot if you're interested.
Hmm, it seems entirely possible to express all of CSS 3 in terms XPath 1.0; no XPath 2.0 required.
I just haven’t gotten to it – honestly, because I was too lazy. CSS 3 has many more syntax elements than the CSS 2 and the new ones are much more complex, so it’s not quite the same kind of 5-minute job.
Re:CSS3 support
miyagawa on 2006-10-03T11:12:34
Really? That sounds great. I was translating:not() CSS 3 selector but couldn't find how to map to XPath 1.0 without using :not(). Maybe I'm missing something obvious? Re:CSS3 support
Aristotle on 2006-10-03T11:41:50
Seems to me that a
[not(subexpr)]
predicate should work. The only trick is to get any references to the context node right insubexpr
, I suppose by usingself::*
or something.Actually, now that you have written the module I may get around to it sooner, since there are working unit tests in there…
Re:CSS3 support
miyagawa on 2006-10-03T13:45:45
Aha, cool. Now I fixed how to handle:not() pseudo-class and map it to [not()], which worked. See updated unit test to confirm. Thanks! Re:CSS3 support
Aristotle on 2006-10-03T14:36:50
Add a case for
*:not(p)
and see if that works. The correct translation should be*[not(self::p)]
, I think.Re:CSS3 support
miyagawa on 2006-10-03T14:55:21
It doesn't work, at least for now.
To support that I should rewrite the parser algorithm somehow, and it will be done when I decide to do a complete CSS 3 selectors support. For now it'll croak.
Re:Why not XML::LibXML in HTML mode?
miyagawa on 2006-10-03T14:46:57
Yeah, XML::LibXML in HTML mode would work, too.
I just picked HTML::TreeBuilder::XPath because i thought it'd be more relaxed to handle non-well-balanced HTML. XML::LibXML::Parser says "HTML (strcit) documents" and that makes me a little nervous:) Re:Why not XML::LibXML in HTML mode?
Aristotle on 2006-10-03T21:05:06
libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.
I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.
TagSoup also works. (Someone should port that one to Perl and/or C…)