Web::Scraper with filters, and thought about Text filters

brian_d_foy on 2007-10-04T20:37:00

A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.

Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.

For instance, if you have an HTML

<span class=".entry-date">2007-10-04T01:09:44-0800</span>
you can get the DateTime object that the string represents, like:

  process ".entry-date", "date" => sub {
    DateTime::Format::W3CDTF->parse_string(shift->as_text);
  };

and with 'filters' you can make this reusable and stackable, like:

package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
 
sub filter {
    DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;
and then:

  process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];
If the .entry-date text contains errorneous spaces, you can do:

  process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];
This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.

So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.

However I have another, more ideal solution in my mind.

The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.

And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.

For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.

Doesn't this suck?

I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.

use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);
So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.

Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.

Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.

Thoughts?


Text::Pipe?

hanekomu on 2007-10-04T08:48:38

I like the "more ideal" solution of having separate text filters. Since Text::Filter is taken, how about Text::Pipe? After all, the factory method shouldn't be able to just give you one filter, but several filters, piped together.

And I wouldn't put the factory in a ::Common module; just call it Text::Pipe::Factory. It generates "pipe segments" that are Text::Pipe::* objects, all of which are derived from Text::Pipe::Base.

Several pipe segments, piped together, could themselves be pipe segments.

Text::Pipe::* objects could have '|' overloaded so you can combine them in a TT-like syntax.

Then again, maybe you don't want to limit yourself to piping text; how about arbitrary data structures? E.g., one pipe segment could take an array and reduce(). But maybe that's going too far. (I've had such an idea many years ago but didn't follow up on it.) Piping text is fine.

Your example regex that deals with erroneous spaces would itself by a pipe segment, something like Text::Pipe::Trim.

Re:Text::Pipe?

miyagawa on 2007-10-04T09:02:55

I don't care much about names, but I disagree letting Text::Pipe itself have the stackable several filters becasue all filters have the same single filter interface, you don't need to.

Creating a stacked pipe is easy by creating a new Pipe stacker object, like:

use Text::Pipe::Stackable;
use Text::Pipe;
 
my $pipe1 = Text::Pipe->new('foo');
my $pipe2 = Text::Pipe->new('bar');
my $pipe3 = Text::Pipe->new('baz');
 
my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);
 
my $output = $stacked_pipe->filter($input);
And I also don't care much about the class structure as well, but it needs to be easy and less code enough for more developers to be able to write a new adapter for a new text filtering engine.

But well, it seems like a bike-shed discussion to me. The detailed API could be improved anytime once the development starts. The important thing is to know if it's a good thing or completely useless.

I'm also interested in writing a pipe for arbitrary data structure like reduce() or trim() that works on array ref. Go look at Test::Base::Filter module that INGY created a while ago. It has several filter function that operates both on string and array.

Re:Text::Pipe?

hanekomu on 2007-10-04T09:13:35

Agreed re bike-shed discussion; one more point though:

        my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);

Yes, that's a better design pattern. In that case, Text::Pipe::Stackable->new() should be able to take both individual segments as well as Text::Pipe::Stackable objects as well (for a kind of recursive construction).

That is, stacked pipes should - to the user - be indistinguishable from individual pipe segments. It's just some black hole that has an input and an output.

Or, in the case of multiplexers, several outputs. Or with reductors, several inputs. Whatever. :)

Re:Text::Pipe?

miyagawa on 2007-10-04T09:17:00

Yeah, it's caled Composite design pattern and also a Decorator.

You mean like Formatter

perigrin on 2007-10-06T10:06:31

Kjetil has this same issue several years ago with his Wiki software. His solution was http://search.cpan.org/user/kjetilk/Formatter-0.95/

I think its a good idea

draven on 2007-10-08T21:48:00

I've been thinking similar thoughts while working on the formatter chain of mojomojo. I think there might be two distinct types of formatters tho, formatters that can work on streams, and formatters that work on distinct pieces of content.

I also think there might be some benefit to providing some more pluggable basic formatters, like html or other text markup, where other formatters can hook into the apropriate place. I guess Web::Scraper is already like that in a way.