A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.
Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.
For instance, if you have an HTML
2007-10-04T01:09:44-0800
you can get the DateTime object that the string represents, like:
process ".entry-date", "date" => sub {
DateTime::Format::W3CDTF->parse_string(shift->as_text);
};
and with 'filters' you can make this reusable and stackable, like:
package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
sub filter {
DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;
and then:
process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];
If the .entry-date text contains errorneous spaces, you can do:
process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];
This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.
So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.
However I have another, more ideal solution in my mind.
The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.
And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.
For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for
each individual text filter engine.
Doesn't this suck?
I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.
use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);
So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements
filter
function that probably takes
$self->config
to configure the filter object.
Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.
Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.
Thoughts?