Web::Scraper with filters, and thought about Text filters

miyagawa on 2007-10-04T08:20:33

A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.

Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.

For instance, if you have an HTML

2007-10-04T01:09:44-0800

you can get the DateTime object that the string represents, like:

  process ".entry-date", "date" => sub {
    DateTime::Format::W3CDTF->parse_string(shift->as_text);
  };

and with 'filters' you can make this reusable and stackable, like:

package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;

sub filter {
    DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;

and then:

  process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];

If the .entry-date text contains errorneous spaces, you can do:

  process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];

This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.

So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.

However I have another, more ideal solution in my mind.

The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.

And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.

For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.

Doesn't this suck?

I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.

use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);

So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.

Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.

Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.

Thoughts?