CSS selector in Perl

miyagawa on 2006-09-23T07:30:03

Ruby library scrAPI looks promising. It allows you to write scraper code using CSS selector, like:

scra per = Scraper.define do
  process 'span.title > a:first-child', :title => :text, :url => '@href'
  process 'ul.list-circle > li:first-child > a', :category => :text
  result :title, :url, :category
end

html = open(url).read
scraper.scrape(html)

In Plagger's EntryFullText module and alike, we use regular experssion and/or XPath to extract these kinds of information, and i think adding CSS selector would be neat too.

Are there already perl module to do the similar things on CPAN? I searched for it but couldn't find any. CSS.pm doesn't do such things.

We could wirte a CSS - XPath transator

dakkar on 2006-09-23T12:55:55

You know, CSS selectors are just a different syntax for a subset of XPath, so we could write a translator from CSS to XPath. I might even get a shot to that...

Re:We could wirte a CSS - XPath transator

miyagawa on 2006-09-23T12:59:43
Yeah, that sounds about right. But does XPath support alternatives and sibling, like "h1 + h2" (h2 that follows immediately after h1)?

Re:We could wirte a CSS - XPath transator

Aristotle on 2006-09-23T15:08:56

Yes.

h1/following-sibling::*[1]/self::h2

Re:We could wirte a CSS - XPath transator

miyagawa on 2006-09-23T15:18:50
Neato. If Xpath can do all of what can be done with CSS2 selectors, translating (or compiling) the CSS selector exp to XPath is da way to go. The benefit of CSS selector is that it's much easier to write than XPath.

Googling "CSS selector to XPath" gives me pretty few results:
http://groups.google.com/group/behaviour/browse_thread/thread/246782199cea5ce9/a 2530a4abe5b12fd?lnk=gst&rnum=1#a2530a4abe5b12fd
http://www.joehewitt.com/blog/2006-03-20.php

Re:We could wirte a CSS - XPath transator

Aristotle on 2006-09-23T15:42:57

It should not be very hard. There are not many selectors in CSS2 and they just need to be translated once. Maybe I should write up the equivalents.

Re:We could wirte a CSS - XPath transator

Aristotle on 2006-09-24T01:07:35

Here you go: How to map CSS selectors to XPath queries.

Re:We could wirte a CSS - XPath transator

miyagawa on 2006-09-24T10:24:12
That's a great one. Thank you!

What about CSS 3 Selectors (Pseudo classes)? Looks like html/selector.rb implements some of those, e.g. :root, :empty, :only-child etc.

Porting html/selector.rb to perl

miyagawa on 2006-09-23T13:05:26

http://labnotes.org/svn/public/ruby/scrapi/lib/html/selector.rb is the ruby code which is really clean, and it'd be pretty straightforward (but boring) to port to Perl.

Re:Porting html/selector.rb to perl

bart on 2006-09-24T08:50:44
Can anybody recommend a quick tutorial to Ruby? I find this piece of source code extremely hard to read. That's because I'm not getting some of the basics in Ruby, of course.