(Note: a first draft of this post has been lying on my shelf for over a year. Now, I finally got around to finishing it.)
Now and then in Perl forums, the discussion comes up about how bad it is to use grep in boolean context. And now it's even been poured into a Perl::Critic rule, based on Damian Conway's PBP book, and the argument they bring up is always the same (see the POD for the module):
Using
grep
in boolean context is a common idiom for checking if any elements in a list match a condition. This works because boolean context is a subset of scalar context, and grep returns the number of matches in scalar context. A non-zero number of matches means a match.But consider the case of a long array where the first element is a match. Boolean
grep
still checks all of the rest of the elements needlessly. Instead, a better solution is to use theany
function from List::MoreUtils, which short-circuits after the first successful match to save time.
Now, why would you expect the item you're looking for to be the first one in the array? I see no reason for such an assumption, at all. On average, you still have to look through about half of the array items, if the item is even in the array. If it isn't, you still have to look through all items, anyway.
So, depending on the likelihood that an item is in the array, you might save between 0% and 50% of execution time by leaving the loop early. Personally, I don't find that overly impressive... As O(n) is still O(n).
The implied assumption is that the overhead of grep
and any
is ignorable, or at least, that it is the same for both, somebody that nobody actually bothered to verify. Well, I bothered. I decided to benchmark grep
vs. any
.
The kind of code that I benchmarked is the simple common problem of testing for the presence of a string in an array, just like IN
in (some dialects of) SQL, and in_array
of PHP.
Here are the prerequisites and the functions I tested:
my @letters = 'A' .. 'Z'; my %letter; $letter{$_} = 1 foreach @letters; # List::MoreUtils' any any => sub { my $x = any { $_ eq $search } @letters }, # grep with expression grepE => sub { my $x = grep $_ eq $search, @letters }, # grep with block grepB => sub { my $x = grep { $_ eq $search } @letters }, # explicitly written code with foreach and last foreach => sub { my $x=0; $_ eq $search and $x=1, last foreach @letters; }, # the expected overall winner: prefilled hash hash => sub { my $x = exists $letter{$search} }, # hash on the fly, rebuilt on every test temphash => sub { my %h; @h{@letters}=(); my $x = exists $h{$search} },
I searched for 'Z' (last item), 'M' (center item), 'C' (pretty up front), 'A' (first item), and 'banana' (not in the list). And here are the results (ActivePerl 5.8.8, Windows XP, 2.4GHz):
Search for Z Rate Time any temphash foreach grepB grepE hash any 50172/s 19.93õs -- -60% -65% -69% -71% -98% temphash 125555/s 7.965õs 150% -- -12% -22% -27% -96% foreach 142037/s 7.04õs 183% 13% -- -11% -18% -95% grepB 160313/s 6.238õs 220% 28% 13% -- -7% -94% grepE 172645/s 5.792õs 244% 38% 22% 8% -- -94% hash 2828893/s 0.3535õs 5538% 2153% 1892% 1665% 1539% -- Search for M Rate Time any temphash grepB grepE foreach hash any 65938/s 15.17õs -- -47% -59% -62% -74% -98% temphash 124203/s 8.051õs 88% -- -22% -29% -51% -95% grepB 160049/s 6.248õs 143% 29% -- -8% -36% -94% grepE 174201/s 5.74õs 164% 40% 9% -- -31% -94% foreach 251362/s 3.978õs 281% 102% 57% 44% -- -91% hash 2733577/s 0.3658õs 4046% 2101% 1608% 1469% 988% -- Search for C Rate Time any temphash grepB grepE foreach hash any 87975/s 11.37õs -- -29% -45% -50% -85% -97% temphash 124136/s 8.056õs 41% -- -22% -29% -78% -96% grepB 158854/s 6.295õs 81% 28% -- -9% -72% -95% grepE 175068/s 5.712õs 99% 41% 10% -- -69% -94% foreach 571092/s 1.751õs 549% 360% 260% 226% -- -80% hash 2926393/s 0.3417õs 3226% 2257% 1742% 1572% 412% -- Search for A Rate Time any temphash grepB grepE foreach hash any 94486/s 10.58õs -- -23% -40% -47% -90% -96% temphash 123096/s 8.124õs 30% -- -22% -30% -87% -95% grepB 158407/s 6.313õs 68% 29% -- -10% -84% -94% grepE 176637/s 5.661õs 87% 43% 12% -- -82% -93% foreach 978762/s 1.022õs 936% 695% 518% 454% -- -64% hash 2687559/s 0.3721õs 2744% 2083% 1597% 1422% 175% -- Search for banana Rate Time any temphash foreach grepB grepE hash any 51867/s 19.28õs -- -58% -62% -69% -72% -98% temphash 123717/s 8.083õs 139% -- -9% -25% -34% -95% foreach 136015/s 7.352õs 162% 10% -- -18% -28% -95% grepB 165077/s 6.058õs 218% 33% 21% -- -12% -94% grepE 187691/s 5.328õs 262% 52% 38% 14% -- -93% hash 2670255/s 0.3745õs 5048% 2058% 1863% 1518% 1323% --
Well, that is looking bad for any
: it is the worst performer in all cases. In the average case ('M'), it is almost 3 times slower than grep
(grep with expression is, as I expected, a bit better than grep with a block, as the latter has an overhead of entering/exiting a lexical block). Even in its best case, any
is still almost twice as slow as grep
. So much for saving. The manually written loop is, in the average case, about a third faster than grep
. But, if the item isn't found, it is actually slower.
No surprise that, if you really want a high speed test, and you need to test against the same array often, it is best to prepare a hash and simply test if the item is in the hash.
What I found rather surprising, is that populating a hash and using it once ('temphash'), isn't such a bad performer.
For completeness' sake, here's the same benchmark on Strawberry Perl 5.10.
Search for Z Rate Time any temphash foreach grepE grepB hash any 46995/s 21.28õs -- -60% -62% -65% -67% -98% temphash 117471/s 8.513õs 150% -- -6% -13% -19% -96% foreach 125313/s 7.98õs 167% 7% -- -7% -13% -95% grepE 135294/s 7.391õs 188% 15% 8% -- -6% -95% grepB 144589/s 6.916õs 208% 23% 15% 7% -- -95% hash 2740628/s 0.3649õs 5732% 2233% 2087% 1926% 1795% -- Search for M Rate Time any temphash grepE grepB foreach hash any 63959/s 15.64õs -- -46% -54% -55% -71% -97% temphash 118040/s 8.472õs 85% -- -15% -18% -46% -95% grepE 138646/s 7.213õs 117% 17% -- -3% -36% -94% grepB 143388/s 6.974õs 124% 21% 3% -- -34% -94% foreach 218040/s 4.586õs 241% 85% 57% 52% -- -91% hash 2516377/s 0.3974õs 3834% 2032% 1715% 1655% 1054% -- Search for C Rate Time any temphash grepE grepB foreach hash any 86710/s 11.53õs -- -25% -36% -39% -85% -96% temphash 115756/s 8.639õs 33% -- -15% -19% -79% -95% grepE 135977/s 7.354õs 57% 17% -- -5% -76% -94% grepB 142657/s 7.01õs 65% 23% 5% -- -75% -94% foreach 559639/s 1.787õs 545% 383% 312% 292% -- -76% hash 2315599/s 0.4319õs 2571% 1900% 1603% 1523% 314% -- Search for A Rate Time any temphash grepE grepB foreach hash any 85902/s 11.64õs -- -26% -36% -41% -90% -96% temphash 115752/s 8.639õs 35% -- -14% -21% -87% -95% grepE 134479/s 7.436õs 57% 16% -- -8% -85% -94% grepB 146193/s 6.84õs 70% 26% 9% -- -84% -94% foreach 902933/s 1.108õs 951% 680% 571% 518% -- -62% hash 2405894/s 0.4156õs 2701% 1978% 1689% 1546% 166% -- Search for banana Rate Time any temphash foreach grepE grepB hash any 51088/s 19.57õs -- -56% -61% -69% -71% -98% temphash 116528/s 8.582õs 128% -- -12% -29% -33% -96% foreach 132626/s 7.54õs 160% 14% -- -19% -24% -95% grepE 164236/s 6.089õs 221% 41% 24% -- -6% -94% grepB 175040/s 5.713õs 243% 50% 32% 7% -- -93% hash 2655952/s 0.3765õs 5099% 2179% 1903% 1517% 1417% --
As an aside: I've got the distinct impression that overall, Strawberry Perl 5.10 is slower than ActivePerl 5.10, by about 5-10%. The biggest surprise to me, however, is that grep BLOCK
is no longer faster than grep EXPRESSION
. So weird.
Now, the thing you would test for must not always be that simple. If it is very slow or otherwise expensive, you might still, and rightfully so, want to leave the testing loop early. But, although elegant as a solution, you should not blindly assume that any
is the best solution for every such problem... Though I wish that grep
would be made smarter, and that it either knows to leave the test loop early, or that you could manually do so, for example, with last
.
p.s. The code to convert Rate to Time was kindly supplied by GrandFather on Perlmonks.
p.p.s. I'm sorry about the formatting problem, that "µ" should look like "õ", but even though it is a "õ" character in the text I posted, the journaling system shows it as "µ". It's a bug, I'm sure.
They are all damn fast, and using any
reads much better.
Your reasoning is flawed. You are assuming that there is one positive match in the set, and that the set is small. On such a small set you're right any isn't any faster, but superior algorithms rarely show up on small datasets.
If you would test this with a larger dataset, any would come out much better. On my computer, using the list 1..1000
and a needle of 500, it is in fact faster. The advantage only gets bigger when the haystack gets bigger.
You're right it's not always better, but it is when it matters the most.
Re:Your reasoning is flawed
Smylers on 2009-02-14T08:51:56
I agree. With such a small set no method is going to take an irritatingly long time. Since you have the program handy could you give it a quick tweak to use a bigger array and re-post the results?
I also agree with autarch that
any
reads better, as a way of signalling programmer intent. I'd probably use theany
fromPerl6::Junction
rather thanList::MoreUtils
; any chance you could add that to your testing as well? Thanks!
Can you test how good would perform List::Util::first()?
Re:List::Util::first
bart on 2009-02-15T14:03:44
Well, you won't like it... Using the code:first => sub { my $x = defined first { $_ eq $search } @letters },
first
is always close to 20% slower thanany
...Search for Z
Rate Time first any
first 38653/s 25.87us -- -21%
any 48955/s 20.43us 27% --
Search for M
Rate Time first any
first 49484/s 20.21us -- -22%
any 63620/s 15.72us 29% --
Search for C
Rate Time first any
first 67151/s 14.89us -- -18%
any 81812/s 12.22us 22% --
Search for A
Rate Time first any
first 73125/s 13.68us -- -21%
any 92772/s 10.78us 27% --
Search for banana
Rate Time first any
first 40513/s 24.68us -- -19%
any 50019/s 19.99us 23% --Re:List::Util::first
srezic on 2009-02-15T19:44:42
Your observation seems to be only true for small arrays. In this plot you can see how first() gets better with larger arrays.In this benchmark grep and first were searching for the middle element in the array.