Web::Scraper 0.14

miyagawa on 2007-09-15T00:19:30

Web::Scraper 0.14 is released along with a couple of neat features.

First of all, I incorpolated HTML::Tagset's linkElements hash into '@attr' accessor of elements, so if you do this:

$s = scraper { process "a", "links[]" => '@href' };
$s->scrape(URI->new("http://www.example.com/"));

because a@href is known to be link elements, they're automatically converted to absoltue URI using http://www.example.com/ as a base URI, even if the value of 'href' is relative.

Prior to 0.14 you had to write:

my $base = URI->new("http://www.example.com/")
$s = scraper {
     process "a", 
         "links[]" => sub { URI->new_abs($_->attr('href'), $base) };
};
$s->scrape($uri);

but you don't need to do that anymore. The same thing happens to all tags known as link elements, like img@src, script@src etc. If you use $s->scrape(\$html) after retrieving $html content from somewhere else, you can pass the base URI as a 2nd parameter for scrape() like:

$mech = WWW::Mechanize->new;
$mech->get(...);

my $s = scraper { ... };
$s->scrape(\$mech->content, $mech->url);

Note that if the HTML content has 'base' tag, the URI absolutification might fail. In that case, you might want to use HTML::ResolveLink from CPAN to fixup the HTML before feeding it into Web::Scraper.

Second, I added a handy shortcut 'TEXT' and its alias 'RAW', to get the HTML data inside the matched tag. As seen on Web::Scraper hack #2, the text node inside script and style tags can't be retrieved using 'TEXT' because they're not technically text. 'HTML' shortcut is basically a shortcut to $_->as_HTML but it cuts the outermost tag (the matched tag itself) so it's more useful.

So the code in hack #2 can be now as simple as:

my $s = scraper {
    process "script", "code" => 'RAW';
};