"avoid grep in boolean context" is premature optimization

bart on 2009-02-13T22:23:16

(Note: a first draft of this post has been lying on my shelf for over a year. Now, I finally got around to finishing it.)

Now and then in Perl forums, the discussion comes up about how bad it is to use grep in boolean context. And now it's even been poured into a Perl::Critic rule, based on Damian Conway's PBP book, and the argument they bring up is always the same (see the POD for the module):

Using grep in boolean context is a common idiom for checking if any elements in a list match a condition. This works because boolean context is a subset of scalar context, and grep returns the number of matches in scalar context. A non-zero number of matches means a match.

But consider the case of a long array where the first element is a match. Boolean grep still checks all of the rest of the elements needlessly. Instead, a better solution is to use the any function from List::MoreUtils, which short-circuits after the first successful match to save time.

Now, why would you expect the item you're looking for to be the first one in the array? I see no reason for such an assumption, at all. On average, you still have to look through about half of the array items, if the item is even in the array. If it isn't, you still have to look through all items, anyway.

So, depending on the likelihood that an item is in the array, you might save between 0% and 50% of execution time by leaving the loop early. Personally, I don't find that overly impressive... As O(n) is still O(n).

The implied assumption is that the overhead of grep and any is ignorable, or at least, that it is the same for both, somebody that nobody actually bothered to verify. Well, I bothered. I decided to benchmark grep vs. any.

The kind of code that I benchmarked is the simple common problem of testing for the presence of a string in an array, just like IN in (some dialects of) SQL, and in_array of PHP. Here are the prerequisites and the functions I tested:

my @letters = 'A' .. 'Z';
my %letter; $letter{$_} = 1 foreach @letters;

# List::MoreUtils' any
any => sub { my $x = any { $_ eq $search } @letters },
# grep with expression
grepE => sub { my $x = grep $_ eq $search, @letters },
# grep with block
grepB => sub { my $x = grep { $_ eq $search } @letters },
# explicitly written code with foreach and last
foreach => sub { my $x=0; $_ eq $search and $x=1, last foreach @letters; },
# the expected overall winner: prefilled hash
hash => sub { my $x = exists $letter{$search} },
# hash on the fly, rebuilt on every test
temphash => sub { my %h; @h{@letters}=(); my $x = exists $h{$search} },

I searched for 'Z' (last item), 'M' (center item), 'C' (pretty up front), 'A' (first item), and 'banana' (not in the list). And here are the results (ActivePerl 5.8.8, Windows XP, 2.4GHz):

Search for Z
               Rate     Time   any temphash foreach grepB grepE hash
any         50172/s  19.93µs    --     -60%    -65%  -69%  -71% -98%
temphash   125555/s  7.965µs  150%       --    -12%  -22%  -27% -96%
foreach    142037/s   7.04µs  183%      13%      --  -11%  -18% -95%
grepB      160313/s  6.238µs  220%      28%     13%    --   -7% -94%
grepE      172645/s  5.792µs  244%      38%     22%    8%    -- -94%
hash      2828893/s 0.3535µs 5538%    2153%   1892% 1665% 1539%   --

Search for M
               Rate     Time   any temphash grepB grepE foreach hash
any         65938/s  15.17µs    --     -47%  -59%  -62%    -74% -98%
temphash   124203/s  8.051µs   88%       --  -22%  -29%    -51% -95%
grepB      160049/s  6.248µs  143%      29%    --   -8%    -36% -94%
grepE      174201/s   5.74µs  164%      40%    9%    --    -31% -94%
foreach    251362/s  3.978µs  281%     102%   57%   44%      -- -91%
hash      2733577/s 0.3658µs 4046%    2101% 1608% 1469%    988%   --

Search for C
               Rate     Time   any temphash grepB grepE foreach hash
any         87975/s  11.37µs    --     -29%  -45%  -50%    -85% -97%
temphash   124136/s  8.056µs   41%       --  -22%  -29%    -78% -96%
grepB      158854/s  6.295µs   81%      28%    --   -9%    -72% -95%
grepE      175068/s  5.712µs   99%      41%   10%    --    -69% -94%
foreach    571092/s  1.751µs  549%     360%  260%  226%      -- -80%
hash      2926393/s 0.3417µs 3226%    2257% 1742% 1572%    412%   --

Search for A
               Rate     Time   any temphash grepB grepE foreach hash
any         94486/s  10.58µs    --     -23%  -40%  -47%    -90% -96%
temphash   123096/s  8.124µs   30%       --  -22%  -30%    -87% -95%
grepB      158407/s  6.313µs   68%      29%    --  -10%    -84% -94%
grepE      176637/s  5.661µs   87%      43%   12%    --    -82% -93%
foreach    978762/s  1.022µs  936%     695%  518%  454%      -- -64%
hash      2687559/s 0.3721µs 2744%    2083% 1597% 1422%    175%   --

Search for banana
               Rate     Time   any temphash foreach grepB grepE hash
any         51867/s  19.28µs    --     -58%    -62%  -69%  -72% -98%
temphash   123717/s  8.083µs  139%       --     -9%  -25%  -34% -95%
foreach    136015/s  7.352µs  162%      10%      --  -18%  -28% -95%
grepB      165077/s  6.058µs  218%      33%     21%    --  -12% -94%
grepE      187691/s  5.328µs  262%      52%     38%   14%    -- -93%
hash      2670255/s 0.3745µs 5048%    2058%   1863% 1518% 1323%   --

Well, that is looking bad for any: it is the worst performer in all cases. In the average case ('M'), it is almost 3 times slower than grep (grep with expression is, as I expected, a bit better than grep with a block, as the latter has an overhead of entering/exiting a lexical block). Even in its best case, any is still almost twice as slow as grep. So much for saving. The manually written loop is, in the average case, about a third faster than grep. But, if the item isn't found, it is actually slower.

No surprise that, if you really want a high speed test, and you need to test against the same array often, it is best to prepare a hash and simply test if the item is in the hash.

What I found rather surprising, is that populating a hash and using it once ('temphash'), isn't such a bad performer.

For completeness' sake, here's the same benchmark on Strawberry Perl 5.10.

Search for Z
               Rate     Time   any temphash foreach grepE grepB hash
any         46995/s  21.28µs    --     -60%    -62%  -65%  -67% -98%
temphash   117471/s  8.513µs  150%       --     -6%  -13%  -19% -96%
foreach    125313/s   7.98µs  167%       7%      --   -7%  -13% -95%
grepE      135294/s  7.391µs  188%      15%      8%    --   -6% -95%
grepB      144589/s  6.916µs  208%      23%     15%    7%    -- -95%
hash      2740628/s 0.3649µs 5732%    2233%   2087% 1926% 1795%   --

Search for M
               Rate     Time   any temphash grepE grepB foreach hash
any         63959/s  15.64µs    --     -46%  -54%  -55%    -71% -97%
temphash   118040/s  8.472µs   85%       --  -15%  -18%    -46% -95%
grepE      138646/s  7.213µs  117%      17%    --   -3%    -36% -94%
grepB      143388/s  6.974µs  124%      21%    3%    --    -34% -94%
foreach    218040/s  4.586µs  241%      85%   57%   52%      -- -91%
hash      2516377/s 0.3974µs 3834%    2032% 1715% 1655%   1054%   --

Search for C
               Rate     Time   any temphash grepE grepB foreach hash
any         86710/s  11.53µs    --     -25%  -36%  -39%    -85% -96%
temphash   115756/s  8.639µs   33%       --  -15%  -19%    -79% -95%
grepE      135977/s  7.354µs   57%      17%    --   -5%    -76% -94%
grepB      142657/s   7.01µs   65%      23%    5%    --    -75% -94%
foreach    559639/s  1.787µs  545%     383%  312%  292%      -- -76%
hash      2315599/s 0.4319µs 2571%    1900% 1603% 1523%    314%   --

Search for A
               Rate     Time   any temphash grepE grepB foreach hash
any         85902/s  11.64µs    --     -26%  -36%  -41%    -90% -96%
temphash   115752/s  8.639µs   35%       --  -14%  -21%    -87% -95%
grepE      134479/s  7.436µs   57%      16%    --   -8%    -85% -94%
grepB      146193/s   6.84µs   70%      26%    9%    --    -84% -94%
foreach    902933/s  1.108µs  951%     680%  571%  518%      -- -62%
hash      2405894/s 0.4156µs 2701%    1978% 1689% 1546%    166%   --

Search for banana
               Rate     Time   any temphash foreach grepE grepB hash
any         51088/s  19.57µs    --     -56%    -61%  -69%  -71% -98%
temphash   116528/s  8.582µs  128%       --    -12%  -29%  -33% -96%
foreach    132626/s   7.54µs  160%      14%      --  -19%  -24% -95%
grepE      164236/s  6.089µs  221%      41%     24%    --   -6% -94%
grepB      175040/s  5.713µs  243%      50%     32%    7%    -- -93%
hash      2655952/s 0.3765µs 5099%    2179%   1903% 1517% 1417%   --

As an aside: I've got the distinct impression that overall, Strawberry Perl 5.10 is slower than ActivePerl 5.10, by about 5-10%. The biggest surprise to me, however, is that grep BLOCK is no longer faster than grep EXPRESSION. So weird.

Now, the thing you would test for must not always be that simple. If it is very slow or otherwise expensive, you might still, and rightfully so, want to leave the testing loop early. But, although elegant as a solution, you should not blindly assume that any is the best solution for every such problem... Though I wish that grep would be made smarter, and that it either knows to leave the test loop early, or that you could manually do so, for example, with last.

p.s. The code to convert Rate to Time was kindly supplied by GrandFather on Perlmonks.

p.p.s. I'm sorry about the formatting problem, that "µ" should look like "µ", but even though it is a "µ" character in the text I posted, the journaling system shows it as "µ". It's a bug, I'm sure.


Who cares about the sppeed?

autarch on 2009-02-14T00:10:22

They are all damn fast, and using any reads much better.

Your reasoning is flawed

Arador on 2009-02-14T01:26:42

Your reasoning is flawed. You are assuming that there is one positive match in the set, and that the set is small. On such a small set you're right any isn't any faster, but superior algorithms rarely show up on small datasets.

If you would test this with a larger dataset, any would come out much better. On my computer, using the list 1..1000 and a needle of 500, it is in fact faster. The advantage only gets bigger when the haystack gets bigger.

You're right it's not always better, but it is when it matters the most.

Re:Your reasoning is flawed

Smylers on 2009-02-14T08:51:56

I agree. With such a small set no method is going to take an irritatingly long time. Since you have the program handy could you give it a quick tweak to use a bigger array and re-post the results?

I also agree with autarch that any reads better, as a way of signalling programmer intent. I'd probably use the any from Perl6::Junction rather than List::MoreUtils; any chance you could add that to your testing as well? Thanks!

List::Util::first

srezic on 2009-02-14T20:18:23

Can you test how good would perform List::Util::first()?

Re:List::Util::first

bart on 2009-02-15T14:03:44

Well, you won't like it... Using the code:

first => sub { my $x = defined first { $_ eq $search } @letters },

first is always close to 20% slower than any...

Search for Z
          Rate    Time first  any
first  38653/s 25.87us    -- -21%
any    48955/s 20.43us   27%   --

Search for M
          Rate    Time first  any
first  49484/s 20.21us    -- -22%
any    63620/s 15.72us   29%   --

Search for C
          Rate    Time first  any
first  67151/s 14.89us    -- -18%
any    81812/s 12.22us   22%   --

Search for A
          Rate    Time first  any
first  73125/s 13.68us    -- -21%
any    92772/s 10.78us   27%   --

Search for banana
          Rate    Time first  any
first  40513/s 24.68us    -- -19%
any    50019/s 19.99us   23%   --

Re:List::Util::first

srezic on 2009-02-15T19:44:42

Your observation seems to be only true for small arrays. In this plot you can see how first() gets better with larger arrays.

In this benchmark grep and first were searching for the middle element in the array.