Jacek Artymiak posted some random Python code to parse the O'Reilly product index.
I've been writing screen scrapers off and on for years. I've read enough Python code to understand this program. Yet, the style of this program really irks me. It's an example of something that works, yet is difficult to read. Without the brief introduction, and the URL embedded within the code, I would have no idea what this program did, and I probably wouldn't care. All I could identify from his program is that it examines an HTML page that contains <tr>, <td>,<a> tags and lots of occurrances of the string http://www.oreilly.com/catalog/
To prove to myself that I'm not being anti-Pythonic, I wrote a screen scraper to do something similar Perl. It took me about ten minutes, mostly because HTML::TableContentParser is such a kickass module, and partly because I can never remember the format of the data it returns. :-) Most screen scrapers I write these days use Data::Dumper while in development to (a) remind me what HTML::TableContentParser returns, and (b) to demonstrate where the content I want to examine is stored.
Here's my version. I've left the comments in, because that's how I wrote the code for myself. I think the intent of this program is much easier to divine based on the code shown below.
#!/usr/bin/perl5.8.0 -w use strict; use LWP::Simple; use HTML::TableContentParser; getstore("http://www.oreilly.com/catalog/prdindex.html", "prdindex.html") unless -e "prdindex.html"; open(my $index, "prdindex.html"); $/ = undef; my $p = new HTML::TableContentParser; ## The catalog is the last table on the page my $catalog = $p->parse(<$index>)->[-1]->{rows}; shift(@$catalog); ## Remove the header row my @fields = qw(title isbn price online_version examples); my @books; foreach my $row (@$catalog) { my %book; @book{@fields} = map {s/^\s+//; s/\s+$//; $_} map {$_->{data}} @{$row->{cells}}; ## Clean up the data some more @book{qw(titleurl title)} = $book{title} =~ m/href="(.*?)">(.*?); ($book{examples}) = $book{examples} =~ m/href="(.*?)\s*"/; ($book{online_version}) = $book{online_version} =~ m/href="(.*?)"/; delete $book{examples} unless $book{examples}; delete $book{online_version} unless $book{online_version}; push(@books, \%book); }
This is definitely "me-ware" in that it works for me and I make no guarantees about not changing it if I need something different out of it, but you might find it useful too.
- Barrie