When declarative isn't really declarative

Alias on 2008-05-14T11:09:18

I'm a huge fan of File::Find::Rule and the elegant compactness with which you can declaratively set up a file search and the ease with which you could extend it.

I even wrote three plugins for it, File::Find::Rule::VCS for skipping CVS/.svn/.bzr directories, File::Find::Rule::Perl for searching for Perl files/scripts/modules/tests/etc, and File::Find::Rule::PPI for integrating PPI-based content searching.

And yet in the 5-10 years I've been using F:F:R, I've been completely seduced by the elegant layout into not just using it with declarative syntax, but THINKING about it declaratively.

And the thing is, it's NOT declarative at all.

This mistake is so common it has even been made by the author himself, and is encoded directly in the canonical SYNOPSIS examples. Like so many others, I probably cut and pasted the SYNOPSIS the first time I used it, and I've been clueless ever since.

Now I've told you that a problem exists, have a look at this snippit from the docs, and the problem should be obvious. # Find all the .pm files in @INC my @files = File::Find::Rule->file ->name( '*.pm' ) ->in( @INC ); This is the obvious way to write that sort of search, because it reads extremely clearly. "Find the files named .pm in somewhere". And because it reads in human terms, we tend to think of it in human terms.

But just look at the order of the rules there.

What this search REALLY says is "Find every single file in all these trees, then do an slow IO stat call to the operating system on every single one to work out which ones are files, and only then do a quick regex match on the file names to keep the 5% that have the ending we want and throw away the 95% that don't".

The result is that even in fairly tame cases we're probably doing 10 or 20 or 100 or even 1000 times as many file system calls to the operating system as we need to.

And I'd be willing to bet that this mistake is endemic in code that uses File::Find::Rule.

There is an IMMENSE performance difference between ->file->name and ->name->file.

Looking through my own repository, I've already found 5-10 places where I've made this mistake, including in two of the three plugins I wrote.

File::Find::Rule::VCS has the appalling behaviour of checking all your files to see if they are a directory, before checking it has the relatively rare name ".svn" or ".bzr".

And those 5-10 are just the obvious ones I could find with a simple string search. I'm sure there's more of them if I look a bit more carefully.

I'm actually wondering if the situation is bad enough to warrant putting an optimization hack directly into File::Find::Rule itself, so that it will bubble up a name rule above any stat-based rules immediately above it, but that makes me a little nervous since this code is so universally used.

Are there any cases where moving a name match before a stat match could go horribly horribly wrong?


Thanks for posting

petdance on 2008-05-14T14:19:01

Excellent points, Adam. I'm looking at making sure that File::Next is doing regexes before start.

short circuit ops

mpeters on 2008-05-14T14:29:11

Basically the rules are a bunch of "ands" or short circuit operators right? So the order doesn't matter for anything but execution speed. So if you have

    common_but_lengthy() && rare_but_short()

It should be re-written as

    rare_but_short() && common_but_lengthy()

Not because it's any more correct, but because as you point out it saves time. So switching the order of how F::F::R does things shouldn't affect the correctness right?

Re:short circuit ops

Alias on 2008-05-15T02:47:15

There are some subtleties involved for pruning, purging, boolean combinations etc.

But I _THINK_ that if you have a simple state rule, immediately followed by a simple name rule, that case can always be swapped.