evolving - Monday 17th November, 2003

richardc on 2003-11-18T01:10:06

Overly and repeatedly dumb.

So today I figured it'd be a good enough time to try and speedup Timesink some. For those not keenly stalking me or my pet projects Timesink is my web-based RSS aggregator which I wrote a while back on migrating away from a dying mac and NetNewsWire.

Anyway, there are a couple of parts of it which are kinda slow, one is the "what's unseen" calculation in the web frontend, and another is the scraper.

Now the unseen code looks like this:

 package Timesink::DBI::Subscriber;
 sub unseen {
      my $self = shift;
      my $feed = shift;
      my ($sub) = Timesink::DBI::Subscription->search({ feed => $feed,
                                                        subscriber => $self });
      my %seen = map { $_->item => 1 } $sub->seen;
      return grep { !$seen{ $_ } } $feed->items;
  }

That is; given the subscriber find out how many items in a given feed are unseen by that subscriber. That you get the actual objects is somewhat a side effect as the web inferface only really cares about the final count.

Okay I think, after determining that this is the slow spot with Devel::Profiler, time to rewrite that as a quick SQL query.

Small flaw in that plan, my live instance runs on mysql 4.0.16, and won't get upgraded till 4.1 until debian unstable does that for me, so I'm stuck with painful rewriting to emulate it. After about half an hour of that, I eventually admited defeat and added this comment:

  +# XXX I be the slowest routine in Christendom.  a sub-select would
  +# probably help, if mysql 4.0 wasn't lame.

So moving on to the second step. Work on the speed of the scraper.

Now the scraper itself is fairly quick, it's only really waiting on upstream servers handing out RSS documents to parse, so if I could just parallelise the downloading that it's going to take less time, even if it's not really quicker.

Now I could see two ways around that, something finicky with LWP::Parallel or the brute force forking of Proc::Queue.

Given that I didn't want to rewrite LWP::Simple's mirror routine I decided to plump for forking, just so long as I remembered to disconnect the dbh I'd be fine, or so I reasoned. So this was my first stab:

Before:

  for my $feed (@feeds) {
     my $rss = $self->get_rss( $feed ) or next;                         
     $self->scrape_feed( $feed, $rss );   
  }

After:

  for my $feed (@feeds) {
      my $pid = fork;
      die "couldn't fork $!" unless defined $pid;
      if ($pid == 0) {
          my $rss = $self->get_rss( $feed ) or next;
          $self->scrape_feed( $feed, $rss );
          exit;
      }
  }
  1 while wait != -1; # reap the kids

Spot the deliberate mistake? Well even if you did, I didn't for a time. Then Rafael asked me why I was grabbing his RSS 30 times a minute.

So I scratched my head, and eventually saw my mistake. Back in the old single process model if get_rss didn't return new rss that was your clue to check the *next* rss feed. Once I'd moved that into a multi-process model the job of the child is not to try again, but to exit gracefully. The fix was as simple as:

  -         my $rss = $self->get_rss( $feed ) or next;
  +         my $rss = $self->get_rss( $feed ) or exit;

Case solved I thought, and went off to watch teevee.

Of course it doesn't end there. But there's bonus points if you guess my next (and hopefully final mistake of the evening).

Yes that's right, I'd forgotten to install the fixed version of the module, so come the next time the script ran it picked up the old DoS-happy version of the module and looped all over again. D'oh.

---

Nothing for months, and then two modules come along in one day.

IO::Automatic and Parse::Debian::Packages pretty much sprung to my fingers unbidden today, the latter a side effect of adding debian support to Leon's cool new Module::Packaged module, the former is a TT-like trick extracted from some code I found myself banging on.

Enjoy.