Screen scraping w/ XPath and CSS Selector in Action

miyagawa on 2006-10-03T09:56:31

Here is more example code that demonstrates how to use XPath and CSS Selector to do screen scraping without using nasty regular expressions.

The task is "Access search.cpan.org for XML and extract 1) how many modules are there and 2) link to the PODs with module names"

There you go:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use HTML::Selector::XPath;
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;

binmode STDOUT, ":utf8";

my $mech = WWW::Mechanize->new;
$mech->get("http://search.cpan.org/search?query=XML&mode=all");

my $count = $mech->xpath(q|//div[@class='t4']/small/b[3]|);
print "Count: ", $count->content->[0], "\n";

my @links = $mech->selector("p > a:first-child");
for (@links) {
    print "Module: ", $_->content->[0]->content->[0], "\n";
    print "Link: ", $_->attr('href'), "\n";
}


sub WWW::Mechanize::selector {
    my($mech, $selector) = @_;
    $mech->xpath(HTML::Selector::XPath->new($selector)->to_xpath);
}

sub WWW::Mechanize::xpath {
    my($mech, $xpath) = @_;

    my @ct = $mech->response->header('Content-Type');

    my $content;
    if ($ct[0] && $ct[0] =~ /charset=([\w\-]+)/) {
        $content = decode($1, $mech->content);
    } else {
        $content = decode_utf8($mech->content);
    }

    my $tree = HTML::TreeBuilder::XPath->new;
    $tree->parse($content);
    $tree->eof;

    my @nodes = $tree->findnodes($xpath);
    return wantarray ? @nodes : $nodes[0];
}

Pretty simple and maintainable, but a couple of things:

1) WWW::Mechanize::selector and ::xpath would be pretty useful. The code is doing the monkey pathc but sounds like it's better to create a WWW::Mechanize plugin to hook HTML::TreeBuilder(::XPath) -- UPDATE: it's now impemented as WWW::Mechanize::TreeBuider on CPAN

2) Using CSS selector from HTML::TreeBuilder::XPath would be a win even if you don't use WWW::Mechanize. Probably 2 lines of code for a new module?

3) Guessing charset from HTTP response header could be separated out to a separate module (HTTP::Response::GuessCharset?). This can be more robust using what we use in Plagger::Util::decode_content, which detects charset code even from meta tag, XML declaration and using Encode::Detect.

4) I hate HTML::Element's content->[0] and content_list stuff. All I want is just "content of the children as HTML (or text)" and "attributes as a hash reference" ($elem->all_external_attr does this).

Excellent idea

mir on 2006-10-03T10:06:12

If that's possible, I would be totally happy to include CSS selectors in HTML::TreeBuilder::XPath (and actually even in XML::XPathEngine). I would love the module to auto-detect which query language is used, but I don't think that's possible, as the syntax overlap.

Re:Excellent idea

ripplesearch on 2006-10-04T00:06:28
Where does CSS::SAC fit into this discussion?

Thanks,

Christopher

Re:Excellent idea

miyagawa on 2006-10-04T02:32:13
Hm, I haven't looked at CSS::SAC. Looks like it's a SAX parser for CSS? My code does use CSS Selector as just a replacdement of XPath and the code can probably make use of CSS Selector Parser to be complete.
Re:Excellent idea

miyagawa on 2006-10-05T06:05:07
Uh, I used Google Code Search to find the probably duplicated work done in CSS::SAC, in January 2005.

Looks like CSS::SAC on CPAN is not updated for a long time (the last update is September 2004) and it's not a bad thing to have a separate, pure perl (and independent of any CPAN module) would not be a bad thing, though.

Re:Excellent idea

ripplesearch on 2006-10-05T06:30:59
Indeed, I just thought I'd point it out as I have been looking for something in perl as good as ScrAPI as I don't have the cycles to write one and haven't yet (with CSS::SAC) the closest. However if we can build something better I am happy. :-)

Christopher

Re:Excellent idea

miyagawa on 2006-10-04T02:29:36
That's totally possible with just a few lines of code, and yeah, auto-detecting selectors from xpath would be impossible. I'm not sure including the feature into H::TB::XPath is the right thing to do. Maybe it is.

Re:Excellent idea

mir on 2006-10-04T09:05:43

I hadn't looked at this at all, but I see that your HTML::Selector::XPath is indeed most of what's needed. Nice job.

I have to thing about it, but at the very least I will add something in the docs about using HTML::Selector::XPath in order to use CSS selectors on XML/HTML modules.

amazing

jmason on 2006-10-05T10:38:26

This is very nice! Thanks!

I've been scraping HTML for a while (since sitescooper), and XPath is definitely the right way to do it, I think.