Web::Scraper is released, the Perl port of Scrapi.rb

miyagawa on 2007-05-09T03:07:13

Today I've been thinking about what to talk in YAPC::EU (and OSCON if they're short of Perl talks, I'm not sure), and came up with a few hours of hacking with web-content scraping module using Domain Specific Languages.

With help from guys on IRC channel and obra who gave a nice talk about DSL in Perl at YAPC::Asia, I whipped up a really small Web::Scraper module.

This is basically a Perl port of Ruby's scrapi toolkit and its API is intended to be similar to ruby's one. So you can write a script to parse Twitter's friend list and extract image URLs for them as:

use URI;
use Web::Scraper;

my $nick = shift || "miyagawa"; my $uri = URI->new("http://twitter.com/$nick");

my $twitter = scraper { process 'a[rel="contact"]', 'friends[]' => scraper { process 'a', url => '@href', name => '@title'; process 'img', src => '@src'; }; result 'friends'; };

my $friends = $twitter->scrape($uri);

use YAML; warn Dump $friends;


I haven't looked at any internal code of scrapi.rb but looked at several examples on the web and confirmed that these scripts run with only slight modification(s). The module is very small amount of code, just 100 lines or so, with fun hacking of perl using local(), goto and function prototypes.

It's still in its alpha quality adn the API will be likely to change a lot, but enjoy and give me feedbacks!


Nice API design

markjugg on 2007-05-09T13:38:09

Nice API design. It's compact and intuitive.

Where are obra's slides

mr_bean on 2007-05-11T01:30:33

I looked at the video and thought of listening to the soundtrack http://tokyo2007.yapcasia.org/sessions/2007/02/abusing_domain_specific_langua.ht ml, now I'm looking for the book, ie the slides, but can't find them.

It's ironic that obra just has a post Wednesday, May 9, 2007, http://obra.livejournal.com/ about how everything you DIDN'T want advertised on the Internet does become public knowledge, but things you do don't.

Re:Where are obra's slides

miyagawa on 2007-05-11T07:30:35

The slides are linked from http://tokyo2007.yapcasia.org/wiki/?SlidesFromTalks, particularly: http://svn.jifty.org/svn/jifty.org/jifty/trunk/doc/talks/yapc.asia.2007.txt