Putting failure to good use

Alias on 2008-01-21T04:14:01

HTML::TrackerLink is one of the first modules I ever wrote.

I created it for my CVS Monitor web app to let people flexible create links to bug tracking systems in their log messages, bugzilla style.

You give H:TL a set of name to uri mappings ( bug => 'http://host/path/bug/%n' ) and it will scan through (HTML-escaped) plain text and find any instances of "bug #1234" or "BuG 1234" or even just "#1234" (if you set a default mapping) and convert them into HTML links.

I've used this a few times, but I always had an issue mixing the "bug #1234" form and the "#1234" form together. You can't use a simple regex and transform one after the other, because the short form will exist in the post-processed version of the long form, and running the short form first catches cases that should match the long form.

This is a big problem when you are mapping to multiple systems.

Enter the utter disaster that was my first attempt at PPI.

When I first sat down to write a Perl parser, the first idea I had to was to use regexes (like everyone else does).

Since I had no idea how to write parsers except for some notional idea you should tokenizer before you build a syntax tree, I decided I'd start by creating a whole collection of regexes for each language element, then just "apply them in parallel" by setting the regex cursor, running every "language element" regex, and seeing which one matched at the earliest character.

Of course, it failed miserably, because to do that you'd need some form of grammar for Perl, and no such grammar exists.

On the weekend, I managed to finally get around to reimplementing this idea to fix HTML::TrackerLink.

Now what I do is build a set of two regex to CODE mappings (one for "with keyword" and one "without") and run them BOTH against the string, run the matched string through the transform code, then move the regex cursor to the end of the successful match. Then lather, rinse, repeat.

I was initially going to just use Regexp::Subst::Parallel but this turned out to be one of those cases where it pays to look at the source code.

Luke's version uses the search regex directly, but anchors them to immediately after the cursor position. If they fail, he steps forward one character and repeats ALL the regexes.

This isn't TOO bad in cases where the the target string primarily contains things that fall inside a match, but is horrendously slow for "sparse matching" like my case where there might be only 2 or 3 matches in several dozen lines of text.

To deal with this, my version prepends a \G(.*?) to capture the entire area before the actual match. When I hit a match, I just dump $1 directly onto the output string before running the transform.

For anyone interested, here's the final code for the (fast) parellising search/replace. (The calling convention is identical to the function form of Regexp::Subst::Parallel).

sub subst {
	my $input = shift;

	# Map the match regex to capture everything BEFORE the match,
	# and the entire pattern provided.
	# (We'll provide them as the first params)
	my @try = map { [ qr/\G(.*?)($_->[0])/s => $_->[1] ] } @_;
	unless ( @try ) {
		# Handle the pathological no-replace case
		return $input;
	}

	# Start the main loop
	my $position = 0;
	my $len      = length $input;
	my $output   = '';
	while ( $position < $len ) {
		my $found = undef;
		my @start = ();
		my @end   = ();
		foreach my $r ( @try ) {
			# Skip if it is not in the string
			pos $input = $position;
			next unless $input =~ $r->[0];

			# Skip if it DOESN'T match earlier
			if ( $found and $start[2] <= $-[2] ) {
				next;
			}

			# This is the best option.
			# Save the matching regex
			$found = $r->[1];
			@start = @-;
			@end   = @+;
		}

		# Break out if no more matches
		last unless $found;

		# Append the pre-match string to the output
		$output .= substr( $input, $start[1], $end[1] - $start[1] );

		# Pass the rest to the transform function
		my $rv = $found->(
			map {
				substr( $input, $start[$_], $end[$_] - $start[$_] )
			} 0 .. $#end
		);
		unless ( defined $rv ) {
			# Transform is signaling an error
			return undef;
		}

		# Transform completed ok
		$output .= $rv;

		# Move the match position for the next iteration
		$position = $end[2];
	}

	# Append the remainder of the string
	return $output . substr( $input, $position );
}