Perl haiku contest, and today's 5-minute hack

petdance on 2004-02-14T19:38:34

ActiveState announced the winners of their Perl haiku contest. The funny thing was that I didn't know any of the names shown, and there sure seemed like a lot of duplicates. Plus, there was a Dishonorable Mention for this entry:

Unreadable code,
Why would anyone use it?
Learn a better way.


Here's why I use it: Because I can write a program to summarize the winners on the web page in 5 minutes.

use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech->get( "http://aspn.activestate.com/ASPN/Perl/Haiku/AboutPerl" );

my @names = ($mech->content =~ /Name: (.+?)
$count{$a} || lc $a cmp lc $b } keys %count ) { printf "%3d: %s\n", $count{$key}, $key; }



tried it ... q on re's

goon on 2004-02-15T05:03:48

thanks for the sample code. tried it. worked on cygwin (beats activestate and I dont have a *nix box online) after a bit of buggerising around with CPAN modules.

One question though - why is it better to use regex's to extract data rather than to use a html parser?

I ask this as I'm currently using re's to extract data from google and writing and debugging regex's takes up the most time.

Re:tried it ... q on re's

petdance on 2004-02-15T05:28:37

It's better in this case because I wanted a quick (5 minutes, remember) and dirty solution to my problem.

If you're extracting data from Google, you may want to look at the link functions in WWW::Mechanize, anyway. It does a lot of the parsing for you.

Another hack, a question and a patch

BooK on 2004-02-16T13:30:44

I needed to know who subscribed and unsubscribed from the Groo mailing-list in the old times. The online archives list every mail received by the list owner between 1995 and 1997, including subcription and unsubscription notifications.

Here's the hack:

#/usr/bin/perl
use strict;
use WWW::Mechanize;

$|++;
my $bot = WWW::Mechanize->new;
$bot->get('http://www.groo.com/mail/');
my @links =
  $bot->find_all_links( text_regex => qr/^(?:UN)?SUBSCRIBE groo-l$/ );

my ( $sub, $uns ) = ( 0, 0 );
for (reverse @links) {
    $bot->get( $_->url );
    $bot->content =~ /<em>Date<\/em>: (.*)/;
    my $date = $1;
    my ( $who, $act, $num );
    $who = $1, $act = "SUB", $num = ++$sub
      if $bot->content =~ /(.*) has been added to groo-l/;
    $who = $1, $act = "UNS", $num = ++$uns
      if $bot->content =~ /(.*) has unsubscribed from groo-l/;
    $who =~ s/&lt;/</g;
    $who =~ s/&gt;/>/g;
    $who =~ s/&amp;/&/g;
    print "$num\t$act\t$who\t$date\n";
    $bot->back;
}

Here's an excerpt of the result (s/@/#/g'ed to protect the innoncent):

1    SUB    Ruben Javier Arellano <rubena#unixg.ubc.ca>    Thu, 28 Sep 1995 21:58:41 -0700
2    SUB    John-Alex Berglund <johnalex#oslonett.no>    Sat, 30 Sep 1995 20:00:51 -0700
3    SUB    Sam <scf2#acpub.duke.edu>    Sun, 1 Oct 1995 15:41:10 -0700
4    SUB    iago#mail.utexas.edu    Sun, 1 Oct 1995 16:02:10 -0700

Here's the question:

Wouldn't it be useful to be able to give a maximum depth for the page stack? This way, when someone writes a bot that runs for a long time and never calls back(), the script does not eat up all the memory.

And while I'm at it, here's the patch:

--- WWW-Mechanize-0.72/lib/WWW/Mechanize.pm     2004-01-13 05:36:36.000000000 +0
100
+++ WWW-Mechanize/lib/WWW/Mechanize.pm  2004-02-16 13:17:30.000000000 +0100
@@ -238,6 +238,11 @@
Don't complain on warnings.  Setting C<< quiet => 1 >> is the same as
calling C<< $agent->quiet(1) >>.  Default is off.

+=item * C<< stack_depth => $value >>
+
+Sets the depth of the page stack that keeps tracks of all the downloaded
+pages. Default is -1 (infinite).
+
=back

=cut
@@ -255,6 +260,7 @@
         onwarn      => \&WWW::Mechanize::_warn,
         onerror     => \&WWW::Mechanize::_die,
         quiet       => 0,
+        stack_depth => -1,
     );

     my %passed_parms = @_;
@@ -1134,6 +1140,21 @@
     return $self->{quiet};
}

+=head2 $mech->stack_depth($value)
+
+Get or set the page stack depth. Older pages are discarded first.
+
+A negative value means "keep all the pages".
+
+=cut
+
+sub stack_depth {
+    my $self = shift;
+    my $old  = $self->{stack_depth};
+    $self->{stack_depth} = shift if @_;
+    return $old;
+}
+
=head1 Overridden L<LWP::UserAgent> methods

=head2 $mech->redirect_ok()
@@ -1402,6 +1423,8 @@
         $self->{page_stack} = [];

         push( @$save_stack, $self->clone );
+        shift @$save_stack if $self->stack_depth >= 0
+                              and  @$save_stack > $self->stack_depth;

         $self->{page_stack} = $save_stack;
     }

Re:Another hack, a question and a patch

petdance on 2004-02-16T15:25:48

Beautiful. Can I please get you to forward that bug-www-mechanize at rt.cpan.org?