ActiveState announced the winners of their Perl haiku contest. The funny thing was that I didn't know any of the names shown, and there sure seemed like a lot of duplicates. Plus, there was a Dishonorable Mention for this entry:
Unreadable code,
Why would anyone use it?
Learn a better way.
thanks for the sample code. tried it. worked on cygwin (beats activestate and I dont have a *nix box online) after a bit of buggerising around with CPAN modules.
One question though - why is it better to use regex's to extract data rather than to use a html parser?
I ask this as I'm currently using re's to extract data from google and writing and debugging regex's takes up the most time.
Re:tried it ... q on re's
petdance on 2004-02-15T05:28:37
It's better in this case because I wanted a quick (5 minutes, remember) and dirty solution to my problem.If you're extracting data from Google, you may want to look at the link functions in WWW::Mechanize, anyway. It does a lot of the parsing for you.
I needed to know who subscribed and unsubscribed from the Groo mailing-list in the old times. The online archives list every mail received by the list owner between 1995 and 1997, including subcription and unsubscription notifications.
Here's the hack:
#/usr/bin/perl
use strict;
use WWW::Mechanize;
$|++;
my $bot = WWW::Mechanize->new;
$bot->get('http://www.groo.com/mail/');
my @links =
$bot->find_all_links( text_regex => qr/^(?:UN)?SUBSCRIBE groo-l$/ );
my ( $sub, $uns ) = ( 0, 0 );
for (reverse @links) {
$bot->get( $_->url );
$bot->content =~/<em>Date<\/em>: (.*)/;
my $date = $1;
my ( $who, $act, $num );
$who = $1, $act = "SUB", $num = ++$sub
if $bot->content =~/(.*) has been added to groo-l/;
$who = $1, $act = "UNS", $num = ++$uns
if $bot->content =~/(.*) has unsubscribed from groo-l/;
$who =~ s/</</g;
$who =~ s/>/>/g;
$who =~ s/&/&/g;
print "$num\t$act\t$who\t$date\n";
$bot->back;
}
Here's an excerpt of the result (s/@/#/g'ed to protect the innoncent):
1 SUB Ruben Javier Arellano <rubena#unixg.ubc.ca> Thu, 28 Sep 1995 21:58:41 -0700
2 SUB John-Alex Berglund <johnalex#oslonett.no> Sat, 30 Sep 1995 20:00:51 -0700
3 SUB Sam <scf2#acpub.duke.edu> Sun, 1 Oct 1995 15:41:10 -0700
4 SUB iago#mail.utexas.edu Sun, 1 Oct 1995 16:02:10 -0700
Here's the question:
Wouldn't it be useful to be able to give a maximum depth for the page stack? This way, when someone writes a bot that runs for a long time and never calls back(), the script does not eat up all the memory.
And while I'm at it, here's the patch:
--- WWW-Mechanize-0.72/lib/WWW/Mechanize.pm 2004-01-13 05:36:36.000000000 +0
100
+++ WWW-Mechanize/lib/WWW/Mechanize.pm 2004-02-16 13:17:30.000000000 +0100
@@ -238,6 +238,11 @@
Don't complain on warnings. Setting C<< quiet => 1 >> is the same as
calling C<< $agent->quiet(1) >>. Default is off.
+=item * C<< stack_depth => $value >>
+
+Sets the depth of the page stack that keeps tracks of all the downloaded
+pages. Default is -1 (infinite).
+
=back
=cut
@@ -255,6 +260,7 @@
onwarn => \&WWW::Mechanize::_warn,
onerror => \&WWW::Mechanize::_die,
quiet => 0,
+ stack_depth => -1,
);
my %passed_parms = @_;
@@ -1134,6 +1140,21 @@
return $self->{quiet};
}
+=head2 $mech->stack_depth($value)
+
+Get or set the page stack depth. Older pages are discarded first.
+
+A negative value means "keep all the pages".
+
+=cut
+
+sub stack_depth {
+ my $self = shift;
+ my $old = $self->{stack_depth};
+ $self->{stack_depth} = shift if @_;
+ return $old;
+}
+
=head1 Overridden L<LWP::UserAgent> methods
=head2 $mech->redirect_ok()
@@ -1402,6 +1423,8 @@
$self->{page_stack} = [];
push( @$save_stack, $self->clone );
+ shift @$save_stack if $self->stack_depth >= 0
+ and @$save_stack > $self->stack_depth;
$self->{page_stack} = $save_stack;
}
Re:Another hack, a question and a patch
petdance on 2004-02-16T15:25:48
Beautiful. Can I please get you to forward that bug-www-mechanize at rt.cpan.org?