Web::Scraper hacks #1: Extract links linking to images

miyagawa on 2007-09-03T16:59:23

I'm trying to put some neat cookbook things using Web::Scraper on this journal. They'll eventually be incoropolated into the module document like Web::Scraper::Cookbook, but I'll post here for now since it's easy to update and give a permalink to.

The easiest way to keep up with these hacks would be to subscribe to the RSS feed of this journal, or look at my del.icio.us links tagged 'webscraper' (which has an RSS feed too).

Want to contribute your experience? Tag them webscraper on del.icio.us so I can follow.

Yesterday I played with What cameraphone do they use? which extracts photo files from blog sites, and used the following code to extract image files.

# extract A links that has IMG inside
my $s = scraper {
    process "a>img", "links[]" => sub { $_->parent->attr('href') }
};

With "a>img" CSS selector, you'll get 'img' tags that follows 'a' tags, then call $_->parent to get its parent tag to retrieve the 'href' attribute.

> echo '' | scraper
scraper> process "a>img", "links[]" => sub { $_->parent->attr('href') }
scraper> y
---
links:
  - foo.jpg

To be more accurate, so that it won't pick up A links that actually don't link to .jpg files, you can write a bit complex XPath expression:

process q{//a[contains(@href,'.jpg')]/img},
  'links[]' => sub { $_->parent->attr('href') };

contains() XPath expression makes sure that the href attribute actually contains ".jpg" somewhere, so it won't pick up A tag linking to HTML file etc.

Better living through superior XPath

Aristotle on 2007-09-03T19:21:38

Note that part of the improved expressiveness of XPath over CSS is that you don’t need to match on the deepest node you are trying to match; you can always climb back up the three, or use assertions:

# Tree navigation: process '//a[contains(@href,".jpg")]/img/..', 'links[]' => '@href'; # Any valid XPath is also valid in an assertion: process '//a[img][contains(@href,".jpg")]', 'links[]' => '@href';

The assertion is clearly the more straightforward way to say what you want here. It says “match all a element nodes anywhere, for which, given that node as context, there is an img element node child, and for which, given that node as context, there is an href attribute child whose values contains .jpg.” So you match just the links you want, and then you can use the Web::Scraper short cut notation to pick up the href attributes.

Note that I mean “any valid XPath” quite literally: you can nest assertions, too, to as many levels as you like/need. F.ex., you might want to say this:

# Match thumbnail image links based on CSS classes process '//a[img[@class="thumb"]][contains(@href,".jpg")]', 'links[]' => '@href';

But if you need to match mostly on the tree below the element you want, it might be more readable to use tree navigation instead. Rather than using a long, complex assertion, possibly with nested subassertions, you just match the entire desired tree shape, then walk back up out of it. In my experience the need for this is rare, but still, it helps to remember that XPath allows you to walk the tree any direction you want.

XPath is much more than just a punctuation-laden version of CSS. :-)

Re:Better living through superior XPath

miyagawa on 2007-09-03T21:01:40

Oh yes, and that's why I prefer CSS selector!

My brain doesn't have enough space to remember stuff like the complete XPath syntax that I rarely use. I guess I should just print out XPath cheat sheet somewhere, though.

Thanks for the "superior" XPath pointer anyway. That works and that's exactly why I keep the XPath support in Web::Scraper :)

Re:Better living through superior XPath

Aristotle on 2007-09-03T22:59:58

Ah, hehe. For me I guess it’s much like with the dereferencing punctuation in Perl: it has a few consistent rules that compose cleanly. So it doesn’t take up any space in my head at all. To each his own. :-)

scRUBYt!

Relipuj on 2007-09-04T15:50:09

Hello,

Just in case you wouldn't know about this fanstastic scraping tool: http://scrubyt.org/getting-started-with-scrubyt/

I'm sure there's a lot of ideas in that application you could include in your module.

Regards,
Relipuj.

Re:scRUBYt!

miyagawa on 2007-09-04T15:57:49
Yes, I've taken a look at it as well, and found scrapi API easier to implement. Making Web::Scraper backend a complete OO and providing different DSL dialects on top of it is a big TODO :)

Re:scRUBYt!

Relipuj on 2007-09-04T16:27:21
This was just a suggestion ;-) being an occasional scripter (and bad at it probably), i realize it's a big thing to do (or i probably cannot realize it ;-).

Personally i don't mind too much about the DSL and the OO interface. An imported function is perfectly ok.

What i love about it, it's that you just give it hints of what you want ("APPLE M9801LL..." and the "71.99" in the example given), and it guesses, correctly in general, what you want to extract...

But now i'd guess it is a lot of work too.

Your module is really great, i'm playing with it since a few hours and it's cool to use the scrAPI from perl :-)

Not really on-topic

titivillus on 2007-09-12T14:21:46

But after reading your slides, I got religion real quick on Web::Scraper. Even presented on it to my Perl Mongers group. Thanks!

Re:Not really on-topic

miyagawa on 2007-09-12T17:45:27
Oh that's awesome. Which perl mongers?

Re:Not really on-topic

titivillus on 2007-09-12T18:03:36
Purdue Perl Mongers in West Lafayette, IN.

The sad part is that I got it working and was testing and writing the thing in the two hours before the meeting, so I don't really have my head around the syntax. The sadder part is that my mashine became unstable after I left so I couldn't SSH and look at example code and demo it. The saddest part was that there were no new members because the promotion machine got bolloxed.

The happy part is that I essentially get a do-over because of all that.