Web::Scraper 0.14 is released along with a couple of neat features.
First of all, I incorpolated HTML::Tagset's linkElements hash into '@attr' accessor of elements, so if you do this:
$s = scraper { process "a", "links[]" => '@href' };
$s->scrape(URI->new("http://www.example.com/"));
because a@href is known to be link elements, they're automatically converted to absoltue URI using http://www.example.com/ as a base URI, even if the value of 'href' is relative.
Prior to 0.14 you had to write:
my $base = URI->new("http://www.example.com/")
$s = scraper {
process "a",
"links[]" => sub { URI->new_abs($_->attr('href'), $base) };
};
$s->scrape($uri);
but you don't need to do that anymore. The same thing happens to all tags known as link elements, like img@src, script@src etc. If you use $s->scrape(\$html) after retrieving $html content from somewhere else, you can pass the base URI as a 2nd parameter for scrape() like:
$mech = WWW::Mechanize->new;
$mech->get(...);
my $s = scraper { ... };
$s->scrape(\$mech->content, $mech->url);
Note that if the HTML content has 'base' tag, the URI absolutification might fail. In that case, you might want to use
HTML::ResolveLink from CPAN to fixup the HTML before feeding it into Web::Scraper.
Second, I added a handy shortcut 'TEXT' and its alias 'RAW', to get the HTML data inside the matched tag. As seen on
Web::Scraper hack #2, the text node inside script and style tags can't be retrieved using 'TEXT' because they're not technically text. 'HTML' shortcut is basically a shortcut to $_->as_HTML but it cuts the outermost tag (the matched tag itself) so it's more useful.
So the code in hack #2 can be now as simple as:
my $s = scraper {
process "script", "code" => 'RAW';
};