Here is more example code that demonstrates how to use XPath and CSS Selector to do screen scraping without using nasty regular expressions.
The task is "Access search.cpan.org for XML and extract 1) how many modules are there and 2) link to the PODs with module names"
There you go:
#!/usr/bin/perl use strict; use warnings; use utf8; use Encode; use HTML::Selector::XPath; use HTML::TreeBuilder::XPath; use WWW::Mechanize;
binmode STDOUT, ":utf8";
my $mech = WWW::Mechanize->new; $mech->get("http://search.cpan.org/search?query=XML&mode=all");
my $count = $mech->xpath(q|//div[@class='t4']/small/b[3]|); print "Count: ", $count->content->[0], "\n";
my @links = $mech->selector("p > a:first-child"); for (@links) { print "Module: ", $_->content->[0]->content->[0], "\n"; print "Link: ", $_->attr('href'), "\n"; }
sub WWW::Mechanize::selector { my($mech, $selector) = @_; $mech->xpath(HTML::Selector::XPath->new($selector)->to_xpath); }
sub WWW::Mechanize::xpath { my($mech, $xpath) = @_;
my @ct = $mech->response->header('Content-Type');
my $content; if ($ct[0] && $ct[0] =~ /charset=([\w\-]+)/) { $content = decode($1, $mech->content); } else { $content = decode_utf8($mech->content); }
my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($content); $tree->eof;
my @nodes = $tree->findnodes($xpath); return wantarray ? @nodes : $nodes[0]; }
If that's possible, I would be totally happy to include CSS selectors in HTML::TreeBuilder::XPath (and actually even in XML::XPathEngine). I would love the module to auto-detect which query language is used, but I don't think that's possible, as the syntax overlap.
Re:Excellent idea
ripplesearch on 2006-10-04T00:06:28
Where does CSS::SAC fit into this discussion?
Thanks,
ChristopherRe:Excellent idea
miyagawa on 2006-10-04T02:32:13
Hm, I haven't looked at CSS::SAC. Looks like it's a SAX parser for CSS? My code does use CSS Selector as just a replacdement of XPath and the code can probably make use of CSS Selector Parser to be complete.Re:Excellent idea
miyagawa on 2006-10-05T06:05:07
Uh, I used Google Code Search to find the probably duplicated work done in CSS::SAC, in January 2005.
Looks like CSS::SAC on CPAN is not updated for a long time (the last update is September 2004) and it's not a bad thing to have a separate, pure perl (and independent of any CPAN module) would not be a bad thing, though.Re:Excellent idea
ripplesearch on 2006-10-05T06:30:59
Indeed, I just thought I'd point it out as I have been looking for something in perl as good as ScrAPI as I don't have the cycles to write one and haven't yet (with CSS::SAC) the closest. However if we can build something better I am happy.:-)
ChristopherRe:Excellent idea
miyagawa on 2006-10-04T02:29:36
That's totally possible with just a few lines of code, and yeah, auto-detecting selectors from xpath would be impossible. I'm not sure including the feature into H::TB::XPath is the right thing to do. Maybe it is.Re:Excellent idea
mir on 2006-10-04T09:05:43
I hadn't looked at this at all, but I see that your HTML::Selector::XPath is indeed most of what's needed. Nice job.
I have to thing about it, but at the very least I will add something in the docs about using HTML::Selector::XPath in order to use CSS selectors on XML/HTML modules.