"avoid grep in boolean context" is premature optimization

bart on 2009-02-13T22:23:16

(Note: a first draft of this post has been lying on my shelf for over a year. Now, I finally got around to finishing it.)

Now and then in Perl forums, the discussion comes up about how bad it is to use grep in boolean context. And now it's even been poured into a Perl::Critic rule, based on Damian Conway's PBP book, and the argument they bring up is always the same (see the POD for the module):

Using grep in boolean context is a common idiom for checking if any elements in a list match a condition. This works because boolean context is a subset of scalar context, and grep returns the number of matches in scalar context. A non-zero number of matches means a match.

But consider the case of a long array where the first element is a match. Boolean grep still checks all of the rest of the elements needlessly. Instead, a better solution is to use the any function from List::MoreUtils, which short-circuits after the first successful match to save time.

Now, why would you expect the item you're looking for to be the first one in the array? I see no reason for such an assumption, at all. On average, you still have to look through about half of the array items, if the item is even in the array. If it isn't, you still have to look through all items, anyway.

So, depending on the likelihood that an item is in the array, you might save between 0% and 50% of execution time by leaving the loop early. Personally, I don't find that overly impressive... As O(n) is still O(n).

The implied assumption is that the overhead of grep and any is ignorable, or at least, that it is the same for both, somebody that nobody actually bothered to verify. Well, I bothered. I decided to benchmark grep vs. any.

The kind of code that I benchmarked is the simple common problem of testing for the presence of a string in an array, just like IN in (some dialects of) SQL, and in_array of PHP. Here are the prerequisites and the functions I tested:

my @letters = 'A' .. 'Z';
my %letter; $letter{$_} = 1 foreach @letters;

# List::MoreUtils' any
any => sub { my $x = any { $_ eq $search } @letters },
# grep with expression
grepE => sub { my $x = grep $_ eq $search, @letters },
# grep with block
grepB => sub { my $x = grep { $_ eq $search } @letters },
# explicitly written code with foreach and last
foreach => sub { my $x=0; $_ eq $search and $x=1, last foreach @letters; },
# the expected overall winner: prefilled hash
hash => sub { my $x = exists $letter{$search} },
# hash on the fly, rebuilt on every test
temphash => sub { my %h; @h{@letters}=(); my $x = exists $h{$search} },

I searched for 'Z' (last item), 'M' (center item), 'C' (pretty up front), 'A' (first item), and 'banana' (not in the list). And here are the results (ActivePerl 5.8.8, Windows XP, 2.4GHz):

Search for Z
               Rate     Time   any temphash foreach grepB grepE hash
any         50172/s  19.93ÃÂµs    --     -60%    -65%  -69%  -71% -98%
temphash   125555/s  7.965ÃÂµs  150%       --    -12%  -22%  -27% -96%
foreach    142037/s   7.04ÃÂµs  183%      13%      --  -11%  -18% -95%
grepB      160313/s  6.238ÃÂµs  220%      28%     13%    --   -7% -94%
grepE      172645/s  5.792ÃÂµs  244%      38%     22%    8%    -- -94%
hash      2828893/s 0.3535ÃÂµs 5538%    2153%   1892% 1665% 1539%   --

Search for M
               Rate     Time   any temphash grepB grepE foreach hash
any         65938/s  15.17ÃÂµs    --     -47%  -59%  -62%    -74% -98%
temphash   124203/s  8.051ÃÂµs   88%       --  -22%  -29%    -51% -95%
grepB      160049/s  6.248ÃÂµs  143%      29%    --   -8%    -36% -94%
grepE      174201/s   5.74ÃÂµs  164%      40%    9%    --    -31% -94%
foreach    251362/s  3.978ÃÂµs  281%     102%   57%   44%      -- -91%
hash      2733577/s 0.3658ÃÂµs 4046%    2101% 1608% 1469%    988%   --

Search for C
               Rate     Time   any temphash grepB grepE foreach hash
any         87975/s  11.37ÃÂµs    --     -29%  -45%  -50%    -85% -97%
temphash   124136/s  8.056ÃÂµs   41%       --  -22%  -29%    -78% -96%
grepB      158854/s  6.295ÃÂµs   81%      28%    --   -9%    -72% -95%
grepE      175068/s  5.712ÃÂµs   99%      41%   10%    --    -69% -94%
foreach    571092/s  1.751ÃÂµs  549%     360%  260%  226%      -- -80%
hash      2926393/s 0.3417ÃÂµs 3226%    2257% 1742% 1572%    412%   --

Search for A
               Rate     Time   any temphash grepB grepE foreach hash
any         94486/s  10.58ÃÂµs    --     -23%  -40%  -47%    -90% -96%
temphash   123096/s  8.124ÃÂµs   30%       --  -22%  -30%    -87% -95%
grepB      158407/s  6.313ÃÂµs   68%      29%    --  -10%    -84% -94%
grepE      176637/s  5.661ÃÂµs   87%      43%   12%    --    -82% -93%
foreach    978762/s  1.022ÃÂµs  936%     695%  518%  454%      -- -64%
hash      2687559/s 0.3721ÃÂµs 2744%    2083% 1597% 1422%    175%   --

Search for banana
               Rate     Time   any temphash foreach grepB grepE hash
any         51867/s  19.28ÃÂµs    --     -58%    -62%  -69%  -72% -98%
temphash   123717/s  8.083ÃÂµs  139%       --     -9%  -25%  -34% -95%
foreach    136015/s  7.352ÃÂµs  162%      10%      --  -18%  -28% -95%
grepB      165077/s  6.058ÃÂµs  218%      33%     21%    --  -12% -94%
grepE      187691/s  5.328ÃÂµs  262%      52%     38%   14%    -- -93%
hash      2670255/s 0.3745ÃÂµs 5048%    2058%   1863% 1518% 1323%   --

Well, that is looking bad for any: it is the worst performer in all cases. In the average case ('M'), it is almost 3 times slower than grep (grep with expression is, as I expected, a bit better than grep with a block, as the latter has an overhead of entering/exiting a lexical block). Even in its best case, any is still almost twice as slow as grep. So much for saving. The manually written loop is, in the average case, about a third faster than grep. But, if the item isn't found, it is actually slower.

No surprise that, if you really want a high speed test, and you need to test against the same array often, it is best to prepare a hash and simply test if the item is in the hash.

What I found rather surprising, is that populating a hash and using it once ('temphash'), isn't such a bad performer.

For completeness' sake, here's the same benchmark on Strawberry Perl 5.10.

Search for Z
               Rate     Time   any temphash foreach grepE grepB hash
any         46995/s  21.28ÃÂµs    --     -60%    -62%  -65%  -67% -98%
temphash   117471/s  8.513ÃÂµs  150%       --     -6%  -13%  -19% -96%
foreach    125313/s   7.98ÃÂµs  167%       7%      --   -7%  -13% -95%
grepE      135294/s  7.391ÃÂµs  188%      15%      8%    --   -6% -95%
grepB      144589/s  6.916ÃÂµs  208%      23%     15%    7%    -- -95%
hash      2740628/s 0.3649ÃÂµs 5732%    2233%   2087% 1926% 1795%   --

Search for M
               Rate     Time   any temphash grepE grepB foreach hash
any         63959/s  15.64ÃÂµs    --     -46%  -54%  -55%    -71% -97%
temphash   118040/s  8.472ÃÂµs   85%       --  -15%  -18%    -46% -95%
grepE      138646/s  7.213ÃÂµs  117%      17%    --   -3%    -36% -94%
grepB      143388/s  6.974ÃÂµs  124%      21%    3%    --    -34% -94%
foreach    218040/s  4.586ÃÂµs  241%      85%   57%   52%      -- -91%
hash      2516377/s 0.3974ÃÂµs 3834%    2032% 1715% 1655%   1054%   --

Search for C
               Rate     Time   any temphash grepE grepB foreach hash
any         86710/s  11.53ÃÂµs    --     -25%  -36%  -39%    -85% -96%
temphash   115756/s  8.639ÃÂµs   33%       --  -15%  -19%    -79% -95%
grepE      135977/s  7.354ÃÂµs   57%      17%    --   -5%    -76% -94%
grepB      142657/s   7.01ÃÂµs   65%      23%    5%    --    -75% -94%
foreach    559639/s  1.787ÃÂµs  545%     383%  312%  292%      -- -76%
hash      2315599/s 0.4319ÃÂµs 2571%    1900% 1603% 1523%    314%   --

Search for A
               Rate     Time   any temphash grepE grepB foreach hash
any         85902/s  11.64ÃÂµs    --     -26%  -36%  -41%    -90% -96%
temphash   115752/s  8.639ÃÂµs   35%       --  -14%  -21%    -87% -95%
grepE      134479/s  7.436ÃÂµs   57%      16%    --   -8%    -85% -94%
grepB      146193/s   6.84ÃÂµs   70%      26%    9%    --    -84% -94%
foreach    902933/s  1.108ÃÂµs  951%     680%  571%  518%      -- -62%
hash      2405894/s 0.4156ÃÂµs 2701%    1978% 1689% 1546%    166%   --

Search for banana
               Rate     Time   any temphash foreach grepE grepB hash
any         51088/s  19.57ÃÂµs    --     -56%    -61%  -69%  -71% -98%
temphash   116528/s  8.582ÃÂµs  128%       --    -12%  -29%  -33% -96%
foreach    132626/s   7.54ÃÂµs  160%      14%      --  -19%  -24% -95%
grepE      164236/s  6.089ÃÂµs  221%      41%     24%    --   -6% -94%
grepB      175040/s  5.713ÃÂµs  243%      50%     32%    7%    -- -93%
hash      2655952/s 0.3765ÃÂµs 5099%    2179%   1903% 1517% 1417%   --

As an aside: I've got the distinct impression that overall, Strawberry Perl 5.10 is slower than ActivePerl 5.10, by about 5-10%. The biggest surprise to me, however, is that grep BLOCK is no longer faster than grep EXPRESSION. So weird.

Now, the thing you would test for must not always be that simple. If it is very slow or otherwise expensive, you might still, and rightfully so, want to leave the testing loop early. But, although elegant as a solution, you should not blindly assume that any is the best solution for every such problem... Though I wish that grep would be made smarter, and that it either knows to leave the test loop early, or that you could manually do so, for example, with last.

p.s. The code to convert Rate to Time was kindly supplied by GrandFather on Perlmonks.

p.p.s. I'm sorry about the formatting problem, that "µ" should look like "ÃÂµ", but even though it is a "ÃÂµ" character in the text I posted, the journaling system shows it as "µ". It's a bug, I'm sure.

Who cares about the sppeed?

autarch on 2009-02-14T00:10:22

They are all damn fast, and using any reads much better.

Your reasoning is flawed

Arador on 2009-02-14T01:26:42

Your reasoning is flawed. You are assuming that there is one positive match in the set, and that the set is small. On such a small set you're right any isn't any faster, but superior algorithms rarely show up on small datasets.

If you would test this with a larger dataset, any would come out much better. On my computer, using the list 1..1000 and a needle of 500, it is in fact faster. The advantage only gets bigger when the haystack gets bigger.

You're right it's not always better, but it is when it matters the most.

Re:Your reasoning is flawed

Smylers on 2009-02-14T08:51:56

I agree. With such a small set no method is going to take an irritatingly long time. Since you have the program handy could you give it a quick tweak to use a bigger array and re-post the results?
I also agree with autarch that any reads better, as a way of signalling programmer intent. I'd probably use the any from Perl6::Junction rather than List::MoreUtils; any chance you could add that to your testing as well? Thanks!

List::Util::first

srezic on 2009-02-14T20:18:23

Can you test how good would perform List::Util::first()?

Re:List::Util::first

bart on 2009-02-15T14:03:44
Well, you won't like it... Using the code:
first => sub { my $x = defined first { $_ eq $search } @letters },

first is always close to 20% slower than any...
Search for Z Rate Time first any first 38653/s 25.87us -- -21% any 48955/s 20.43us 27% -- Search for M Rate Time first any first 49484/s 20.21us -- -22% any 63620/s 15.72us 29% -- Search for C Rate Time first any first 67151/s 14.89us -- -18% any 81812/s 12.22us 22% -- Search for A Rate Time first any first 73125/s 13.68us -- -21% any 92772/s 10.78us 27% -- Search for banana Rate Time first any first 40513/s 24.68us -- -19% any 50019/s 19.99us 23% --

Re:List::Util::first

srezic on 2009-02-15T19:44:42
Your observation seems to be only true for small arrays. In this plot you can see how first() gets better with larger arrays.
In this benchmark grep and first were searching for the middle element in the array.