Finally some perl!

Matts on 2006-05-31T22:01:41

Yes it's been yonks since I posted any perl. Well today I learned that read() can take an offset to where to put data in your $buf, so I can implement what should be an efficient grep for multiple strings in binary data (i.e. where I can't do: while (<$fh>) ). So given $fh and @strings and max() from List::Util, I can do this:

    my $max_len = max(map{length} @strings);
    my $regexp = "(" . join("|", map {quotemeta} @strings) . ")";
    my $buf = '';
    while (1) {
        substr($buf, 0, length($buf) - $max_len) = "";
        my $len = read($fh, $buf, 8096, length($buf));
        last unless $len;
        if ($buf =~ /$regexp/o) {
            return $1;
        }
    }
I could probably add code to show where in the file it matched, but I don't need that.


Use array storage instead?

ChrisDolan on 2006-06-01T01:23:48

I've found something more like the following to be useful:

                my $max_len = max(8096,map{length} @strings);
                my $regexp = "(" . join("|", map {quotemeta} @strings) . ")";
                my @buf = ('');
                while (1) {
                        my $len = read($fh, $buf[1], $max_len);
                        last unless $len;
                        if ("@buf" =~ /$regexp/o) {
                                return $1;
                        }
                        shift @buf;
                }

I've not benchmarked which solution is faster, but I'd speculate that the substr is slower than the concat plus shift. Note that this implementation pulls a minimum of 8096 bytes at a time, but it might be longer is the match strings are longer.

Re:Use array storage instead?

Matts on 2006-06-01T01:51:30

Think again:
      s/iter chris  matt
chris   4.61    --  -50%
matt    2.32   99%    --

/o is *evil*

duff on 2006-06-01T14:46:49

I hope you're using this code only as part of the mainline of a script and not as a subroutine or in a module. But even still, it's often the case that scripts turn into modules at some point, so don't use /o if you can help it. It doesn't really buy you anything and could actually cause problems as your program grows. If you think you need /o, you really just need qr//. When someone uses /o it is almost always a case of premature optimization (99.99999999999% of the time :-).

Re:/o is *evil*

Matts on 2006-06-01T14:52:36

True. Laziness was in the way. Changed to qr() now.

Use a meaningful loop termination condition

Aristotle on 2006-06-01T20:28:03

It’s easy to arrange:

my $max_len = max map length, @strings;
my $rx = qr/(@{[ join '|', map quotemeta, @strings ]})/;
my $buf = '';
while ( read $fh, $buf, 8096, length $buf ) {
    return $1 if $buf =~ $rx;
    substr $buf, 0, length $buf - $max_len, '';
}

As a bonus the code is shorter and clearer.

The following is a tweak to avoid unnecessarily shrinking $buf when the space is needed for the read that immediately follows, and it may or may not be faster by a few percent. I didn’t benchmark (or even test) it.

my $max_len = max map length, @strings;
my $rx = qr/(@{[ join '|', map quotemeta, @strings ]})/;
my $buf = '';
my $offs = 0;
while ( my $read = read $fh, $buf, 8096, $offs ) {
    substr( $buf, $offs + $read ) = ''; # throw away residue
    return $1 if $buf =~ $rx;
    substr( $buf, 0, $max_len ) = substr $buf, -$max_len;
    $offs = $max_len;
}