I've been playing with iterators at work this week. It's a great way to create stackable filters for a stream of input.
What I'm trying to accomplish is write a pattern matcher that will identify certain common multi-line patterns in an input file -- such as a diff between two XML files. I've already done a decent amount of post-processing, and don't want to keep munging my post-processors to make all of the meaningless differences conform.
One set of files I'm dealing with is about 250,000 lines of XML output, about 7.5MB. The diff between the old and new versions is about 1.4MB, or about 1600 divergent blocks. Most of these differences are effectively meaningless -- open+close tags vs. empty tags, extraneous whitespace, etc. Filtering out those meaningless differences, I can isolate the ~200 divergent sections that I really need to fix.
I suppose I could read in the whole megabyte diff and then iteratively run a series of regexes over the input to weed out the meaningless diffs. But that just felt a wee bit too hinky. My current solution works something like this:
my $generator = sub {return scalar <>}; my $filter = make_filter($generator, @pattern1); $filter = make_filter($filter, @pattern2); $filter = make_filter($filter, @pattern3); [...] ## Read the entire file and ignore each pattern I don't care about print while (defined($_ = $filter->()));This has been really working wonderfully. Except that I can't seem to find a tremendous amount of literature about iterators (I managed to avoid Common Lisp until now; and it's starting to bite me).
Iterators are mentioned fairly frequently in the pattern literature, but the discussion generally assumes that you know how or have the libraries around to create and use them. They're generally very simple: the Java java.util.Iterator
interface just has the operations hasNext()
(are there any more?), next()
(returning the next item) and the optional remove()
which deletes the last element kicked out of the iterator.
MJD has some excellent articles that center on or around iterators. In fact, after seeing one of his talks at last year's YAPC I sent up to the dorm and created an iterator class for SPOPS so you don't have to deal with an entire query full of records at once.
Re:Iterator info
ziggy on 2002-05-02T17:59:00
I've looked into the GoF discussion about iterators, and it looks like the Iterator pattern is simply an interface that classes can implement, or an adaptor to translate a tree into a linear sequence. The java.util.Iterator interface is a case in point. These also dominate the search results at google.What I'm looking for is deeper than that. For example:
Note that these filters are both statically bound and composable. Once the initial generator is finished, I can't reuse the pipeline I've defined that ignores two classes of input. Also, the behavior of the iterator that's applying 0/1/many filters is identical to the simple generator with no filters tacked on.sub make_iterator($$) {
my $next = shift;
my $pattern = shift;
return sub {
my $line = $next->();
while (defined ($line) && $line =~ m/$pattern/) {
$line = $next();
}
return $line;
};
}
my $generator = sub {return scalar <>};
my $filter = make_filter($generator, qr{^$});
$filter = make_filter($filter, qr{^skip});
print while (defined ($_ = $filter->()));Surely this is possible with Java/C++ STL, but that's not the idiom I've seen used with those languages, where classes that unroll trees (e.g. an XML DOM) into a serial data structure (e.g. SAX events, XMLPull) are much more common.
I've talked to MJD about this, and he says most of the literature I'm looking for is scattered about the Common Lisp universe. He also mentioned a similar pattern called gatherer that is sort of an inverted iterator:
my $gatherer = make_gatherer(...);
while (<>) {
$gatherer->next($_);
}
print $gatherer->results();Re:Iterator info
darobin on 2002-05-02T19:22:25
My memory is quite fuzzy on this, but I seem to recall that DOM2's chapter called DOM Traversals has tree unrollers that are a bit more beefed up than the vanilla ones. Not sure that it fits your problem (or that I remember correctly) but you might want to check them out.
co-routines
jmm on 2002-05-02T19:42:13
If you have a language that supports co-routines nicely, you can easily chain together iterators and gatherers.
It works essentially the same way as STDIN and STDOUT on a pipeline. Each filter does a$val = resume( $predecessor, $val )
to get the next item that it is to filter, and does$val = resume( $successor, $val )
to pass on the next item that it has generated. (Having the $val item be passed to the other both ways allows each co-routine to choose what buffer is used rather than having a predefined mailbox that stuff has to be copied into.)Re:Iterator info
Matts on 2002-05-03T11:37:41
This looks awfully like AxKit's post-processing filter stuff (useful for adding in dynamic adverts into static content). Here's the code (slightly modified for sanity):Of course it's static (using @_) rather than pulling from a file descriptor, but it's interesting to see people using similar patterns.sub get_output_transformer {
my $func = sub { @_ };
my $actually_transform = 0;
foreach my $AxOutputTransformer ( $AxKit::Cfg->OutputTransformers() ) {
$actually_transform = 1;
my $outputfunc = $func;
no strict 'refs';
$func = sub {
map { &{$AxOutputTransformer}( $_ ) } ($outputfunc->(@_));
};
}
return wantarray ? ($func, $actually_transform) : $func;
}