The curious case of File::Find

richardc on 2002-07-19T05:19:06

So there was a London.pm technical meeting last night, earlier today, something like that. Leon gave a short talk about some modules he particularly liked and some he didn't. On his naughty list was File::Find

Personally, I don't have a problem with File::Find, but I know from my time on irc that people whine about it a lot. My hunch as to why people don't like it is that it makes you use callbacks, which can confuse people. To add insult to confusion, once you're in the callback you have to gyrate oddly with package variables.

The started me wondering, if so many people don't like the thing, why isn't there an alternative? Of course this took me nowhere, so I just wrote it off as insufficient JFDI in the world.

So now, until I either crack it or get bored, I'm trying to come up with a better way to do what File::Find gives you, and it's not proving terribly easy.

What I'm currently thinking of something like:
my @found = NewFind->name( '*.mp3' )->type( 'file' )->size( '<1000K' )->find( '.' ); # finds all smallish mp3s

Each method returns an object apart from the find method, which returns a list of things that matched your specification. An or-like operation could look like this:
my $f = NewFind->type( 'file' )->or( NewFind->name( '*.mp3' ), NewFind->name( '*.ogg' ) ); # search for oggs and mp3s

How does that hit people - rampant wheel reinvention, or a good start? Comments and alternative api suggestions welcome, hopefully we can get to something that will suit how people want to find files.


Yes!

kasei on 2002-07-19T07:06:30

Looks good. Grouping in the search clause might be awkward syntactically, but for fairly simple stuff, I like it. You might also provide a way to provide a code block that returns a boolean for custom stuff; So, adding to your example, try looking for smallish mp3s that have some specific ID3 tag:

my @found = NewFind->name( '*.mp3' )->type( 'file' )->size( '<1000K' )->custom( sub { ...search for id3 tags here... } )->find( '.' );

"custom" is probably the wrong name for it, but (hopefully) you understand what I'm getting at.

Come to think of it, the search for mp3s OR oggs would be easy with Q::S:

my $f = NewFind->type( 'file' )->name( any('*.mp3', '*.ogg') );

:)

Syntactic sugar for less verbosity

rafael on 2002-07-19T07:12:52

Random thoughts :

NewFind->file() instead of ->type('file')
NewFind->name( '*.mp3', '*.ogg' ) to get a shorter 'or' condition (in this case, can be written as NewFind->name( qr/\.(mp3|ogg)$/ ))
Provide an ->exec( \&command ) hook, similar to the -exec option to find(1) : i.e., gets the pathname as its only parameter, returns true or false.
Think about -prune and finddepth.

Re:Syntactic sugar for less verbosity

richardc on 2002-07-19T09:01:43

NewFind->file() instead of ->type('file')

Yes, I was being very literal in a transliteration of a find(1) example, apart from making it longer of course.

NewFind->name( '*.mp3', '*.ogg' ) to get a shorter 'or' condition (in this case, can be written as NewFind->name( qr/\.(mp3|ogg)$/ ))

I like both of those.

I still think there's need for a form of or. I just can't think of a good example right now.

Good example for 'or'

rafael on 2002-07-19T09:16:54

my $finder = NewFind->or(
  NewFind->name( '*.pl' ),
  NewFind->exec( sub {
    my $file = shift; my $fh;
    if (open $fh, $file) {
      my $shebang = <$fh>;
      close $fh;
      return $shebang =~ /^#!.*\bperl/;
    }
    return 0;
  } ),
);

Re:Syntactic sugar for less verbosity

jdavidb on 2002-07-19T17:01:52

You still need an explicit or when your conditions are not on the same variable. The form shown here works for var == val1 or val2 but not var1 == val1 or var2 == val2. For example, file is greater than 500M or older than 3 days.

Re:Syntactic sugar for less verbosity

richardc on 2002-07-21T11:04:59

I'm sorry, I don't follow. var == val1 or val2 seems to be like name('*.mp3', '*.ogg') and var1 == val1 or var2 == val2 is:
# files greater than 500M or older than 3 days
F->or( F->size( '>500M' ),
       F->age ( '> 3 days' )
     );

As in rafaels "Good example for 'or'" post

Re:Syntactic sugar for less verbosity

jdavidb on 2002-07-29T13:54:01

Yes, you followed what I was saying. I was giving an example where you have to have an explicit F->or method.

Re:Syntactic sugar for less verbosity

richardc on 2002-07-22T19:17:25

My thoughts about finddepth are to ignore it, as it confuses me.

Can I get a quick show of hands as to whether people will really miss this? If so then I'll find a way to make it work.

Interface

PerfDave on 2002-07-19T10:24:52

Yeah, I thought about doing a "nice" version of File::Find myself. I wanted to tie in a few things as well - the ability to get structured data as well as lists from it, and to cache data.

I'm not sure I like the stream-y interface you've got here - powerful, but violates the KISS principle which makes File::Find such a pain in the arse to use at the moment. I'd pictured more of a hash-based interface, but I hadn't thought of options so much (but I guess they could be done either by regexps or arrays).

my $fh = File::Find::Foo->new ( { start => '.', name => ['*.ogg', '*.mp3'], maxdepth => 2 } );

Note the use of an array for alternation, and the way the parameters are the same as you pass to find(1). I suspect '.' would be a sensible default for the start attribute. The new() command should do the file search and keep it in an internal data structure.

$fh->gettree(); should return a based data structure, while $fh->getlist(); will get a list of filenames relative to start. There should be some kind of refresh() method to run the search again if files have changed.

Re:Interface

richardc on 2002-07-19T11:15:24

I'm not sure I like the stream-y interface you've got here - powerful,

I wasn't either, hence the posting, but then I saw Rafael's beautiful "Good example for 'or'" for which powerful is certainly one of the words I'd use.

By stream-y I assume you mean the chaining of method calls? If you don't like it you can always do it in longhand:

my $f = NewFind->new();
$f->name( '*.mp3', '*.ogg' );
$f->size( '<10000' );
my @potayto = $f->find('.');

my @potahto = NewFind->name( '*.mp3', '*.ogg' )
                     ->size( '<10000' )
                     ->find('.');

but violates the KISS principle

I'm with Jarkko on this one

More seriously, I don't see what's not simple in calling methods on an object.

$fh->gettree(); should return a based data structure, while $fh->getlist(); will get a list of filenames relative to start. There should be some kind of refresh() method to run the search again if files have changed.

Your getlist seems to be my find, your refresh seems to be my find called a second time. gettree is interesting, could you provide a simulated dump of what it would hold?

Re:Interface

djberg96 on 2002-07-19T12:12:53

Whenever I see method chaining, I take the opportunity to point out Robin Houston's Want module. Take a look. Maybe you could use it.

Re:Interface

richardc on 2002-07-19T12:52:39

It's an interesting module, sure enough. I don't really imagine needing to bring the big guns in.

My current plan is to call the module File::Find::Ruleset. This seems to do some of the hard explaining for me, there are two types of methods, those that add rules to the ruleset, and those that ask questions of it.

The methods that add new rules (name, size, exec, or) return the ruleset object, which makes chaining easier. Those that ask questions (find, find_as_superbly_complex_hash) will return what is expected of them. I don't imagine someone would really want to write:

my @files = File::Find::Ruleset->name( '*.mp3', '*.ogg' )
                               ->exec( sub { artist( $_ ) eq 'Black Lace' } )
                               ->find( $HOME )
                               ->find( '/mnt/mp3' );

And after their first run-time error they shouldn't really do it again. Actually, having said that Want is a fine way to stop that exploding at the user, or at least to throw a more reasonable error message.

btw. I'm really liking the ->exec rule now.

Re:Interface

PerfDave on 2002-07-19T12:13:52

The notation of chaining method calls like that is an unfamiliar one to me at least (see, told you I was crap), and probably to the target audience of the module (viz. those who find all those icky callbacks in File::Find too tricky).

As for the tree, I envisaged some kind of structure depending on what parameters you pass to the find involving some quantities of stat()ing etc., thusly:

$VAR1 = {
      name => '.',
      path => '.',
      abspath => '/home/richardc',
      type => 'directory',
      children => [
            {
              name => 'music',
              path => './music',
              abspath => '/home/richardc/music',
              type => 'directory',
              children => [
                    {
                      name => 'S Club 7 - Party.mp3',
                      path => './music/S Club 7 - Party.mp3',
                      abspath => '/home/richardc/music/S Club 7 - Party.mp3',
                      type => 'file'
                    }
                          ]
            },
            {
              name => 'Procul Harum - A Whiter Shade of Pale.ogg',
              path => './Procul Harum - A Whiter Shade of Pale.ogg',
              abspath => '/home/richardc/music/Procul Harum - A Whiter Shade of Pale.ogg',
              type => 'file'
            }
                  ]
    }

Basically enough information to reconstruct the file list but letting you walk the tree structure.

Re:Interface

Thomas on 2002-07-19T15:10:12

It was almost the same thing I came up when thinking about it.



I'd like to see something like this:



my @files = find('/tmp', { maxdepth => 10, mindepth => 5, name => qr/\.*\.c$/ });



(Which I btw already have working in a small example I hacked together.) I really like your idea of using arrays for alternation, but how do you decide wheter to AND or OR? (AND doesn't make much sense in your example though).



Instead of returning the files I'd also consider an 'exec' like option that took a sub ref or function and just passed the filename to it as the first argument.



That is about all I'd need from a Perl find.

Re:Interface

Thomas on 2002-07-19T16:42:14

I've made a quick implementation of what I'd like to see File::Find offer instead. I think it might be a good idea to do a real print function and instead call the one I've used 'return'.

It is of course not nearly done, but if anyone feel they like the concept and want to use it please feel free to do so.

package Find;

use strict;
use vars qw($VERSION @EXPORT @EXPORT_OK %EXPORT_TAGS @ISA);
require Exporter;

@ISA = qw(Exporter);
@EXPORT = qw(find);
@EXPORT_OK = qw(find);

$VERSION = '0.1';

sub find {
        my $dir = shift || die "Need a directory to start at\n";
        die "Not a directory\n" unless -d $dir || ref($dir) eq 'ARRAY';
        my $opts = shift || {};
        $opts->{print} = defined($opts->{print}) ? $opts->{print} : 1;
        my $depth = 1;
        my @files = ();
        if (ref($dir) eq 'ARRAY') {
        traverse_dir($_, $depth, $opts, \@files) for @{$dir};
        } else {
        traverse_dir($dir, $depth, $opts, \@files);
        }
        return @files;
}

sub traverse_dir {
        my $dir_name = shift;
        my $depth = shift;
        my $opts = shift;
        my $files_ref = shift;
        if (defined($opts->{maxdepth}) && $opts->{maxdepth} {_ok} = 1;
        if (defined($opts->{name})) {
                if ($file !~ /$opts->{name}/) {
                $opts->{_ok} = 0;
                }
        }
        if (defined($opts->{mindepth})) {
                if ($depth {mindepth}) {
                $opts->{_ok} = 0;
                }
        }
        if ($opts->{_ok}) {
                if ($opts->{print}) {
                push @{$files_ref}, $dir_name . "/" . $file;
                }
                if (defined($opts->{'exec'})) {
                        &${$opts->{exec}}($dir_name . "/" . $file);
                }
                }
        traverse_dir($dir_name . "/" . $file, $depth + 1, $opts, $files_ref) if -d $dir_name . "/" . $file;
        }
        closedir(DH);
}

1;

__END__

=head1 NAME

Find - find files in filesystem

=head1 SYNOPSIS

        use Find;

        find('/tmp', { 'exec' => \sub { print $_[0] . "\n"; }});

=head1 OPTIONS

        find('directory', { .. options .. });
        find(['directory1','directory2'], { .. options .. });

        It currently has the following options

=head2 print

        print in this context refers to wheter or not to return an array of
        the files found.
       
        for (find('/tmp', { 'print' => 1 })) {
                print $_ . "\n";
        }

        Would print a list of files found in the '/tmp' directory.
        print is set to return a list of files by default.

=head2 exec

        exec is an anonymous reference to a function to call which gets passed
        as the only argument the filename matching any other conditions.

        find('/tmp', { 'exec' => \sub { print $_[0] . "\n"; } });

=head2 name

        name is a regular expression to which the file is matched. If the file
        in question matches the file the other tests on the file are performed.

        find('/tmp', { name => qr/\.c$/, 'exec' => \sub { print $_[0] . "\n";} });

=head2 maxdepth

        maxdepth is the maximum number of directories to descend into.
 
=head2 mindepth

        mindepth is the minimum number of directories where matching starts.

=end

Re:Interface

richardc on 2002-07-21T10:55:29

my @files = find('/tmp', { maxdepth => 10, mindepth => 5, name => qr/\.*\.c$/ });

Warning: hashes found not to have a predictable order. Evaluating the rules in a known order can have a huge effect on efficiency. Please look at rafael's "Good example for 'or'" comment, if the exec happened first you'd really feel a hit from it just shortcutting that happens at name('*.pl').

(Which I btw already have working in a small example I hacked together.)

Please, don't let me stop you releasing your code to CPAN. Some choice in this area is good, certainly considering the current state where people just complain about the current state of affaris without writing an alternative.

For fun, mine's here, note that it still needs a buttload of work on the documentation. Later today I was planning on added some code to make this work

my @mp3s = find_luddite( name => '*.mp3', file )->('.');

But maybe find_luddite isn't the best name :)

I really like your idea of using arrays for alternation, but how do you decide wheter to AND or OR? (AND doesn't make much sense in your example though).

AND is the default.

Finder->files
      ->name( '*foo*' );

Means they have to be files, AND they have to have names that match *foo*.

Instead of returning the files I'd also consider an 'exec' like option that took a sub ref or function and just passed the filename to it as the first argument.

Check. Well actually I pass the filename, the directory, and the full path, but that's because you never know what'll really be useful.

That is about all I'd need from a Perl find.

Bzzt! About all I need from a find interface is File::Find. This is a deeply bogus argument, and it's left us where we are since the dawn of time.

In writing this code I'm trying to anticipate what other users will want. You're one of them, but you're not the only one otherwise I'd not even have thought about making the code useable as an iterator, so you'll be able to write.

$finder->find_iteratively('.');
while (my $file = $finder->iterate) {
     ...
}

Warning: routines not really called find_iteratively and iterate. They don't have names right now.

Currently the code is a wrapper around File::Find, so the iterator behavior is faked, but if I ever get brave enough to implement all the cross-platform stuff for myself then that'll be a proper iterator. And then there will be rejoicing in the streets. Oh yes.

Re:Interface

richardc on 2002-07-21T16:41:18

I thunk it out a bit more, and now it's possible to do this:
# extract from the test suite
# procedural form of the prune CVS demo
use File::Find::Ruleset 'find_ruleset';
$f = find_ruleset(or => [ find_ruleset( directory =>
                                        name      => 'CVS',
                                        prune     =>
                                        discard   => ),
                          find_ruleset() ]);
my @things = $f->find('.');

Yes, it's still calling a method but if don't ever want to see an arrow you can do that too, at a slight performance cost:

my @mp3s = find_ruleset( file => name => '*.mp3', find => '.');

Iterators?

geoff on 2002-07-19T12:02:26

MJD presents a File::Find replacement in his Programming with Iterators and Generators tutorial (at least he did when he gave his practice session at phl.pm). Perhaps his approach is worthy of investigating as well.

and and and not

wickline on 2002-07-19T18:25:40

You need both OR and AND for sufficiently complex queries...

    (A or B) and ( (C and D) OR E )

    You could use chaining, but you'd need
    two objects in the above, and the logic
    might be clearer is you just had an AND

you may want to include both OR and AND and NOT as well
because some folks will find it easier to compose

    (A or B) and NOT( C or D or E)
than
    (A or B) and ( NOT C and NOT D and NOT E)

...especially when C, D, E may be moderately complex.

Maybe that's overkill though.

To ensure overkill, also include methods for those who
think in terms of set operations (Union, Intersection,
and Difference). These could just be different names
for the same code.

    Union(A,B)      =>  A OR B
    Intersect(A,B)  =>  A AND B
    Difference(A)   =>  NOT A

going off the deep end, you could overload some operators

    Union(A,B)      =>  A OR B   =>  $queryA + $queryA
    Intersect(A,B)  =>  A AND B  =>  $queryA * $queryB
    Difference(A)   =>  NOT A    =>  !$queryA

The result of each of the above would be a new query object.
Heck, you could throw in subtraction while you're at it

    $qc = $qA - $qB  #   C = A AND NOT B

time to stop rambling :)

-matt

How to find real Link names with File::Find::Rule

venki on 2007-04-09T12:54:36

If there is a link to a directory, with File::Find::Rule, we are getting the actual names and not the link names. Is there any way to get the links also along with the real names for directories. We are getting the links also for file names, ofcourse.
Ex : abc -> cde in directory /test/ and there is a file in /test/cde/2222.
When we do a File::Find::Rule->new()->name->(qr/\d\d\d\d/)->in("/test/"), we are getting only /test/cde/2222.
Is there any way we can get /test/abc/2222 also ?

Re:How to find real Link names with File::Find::Ru

runrig on 2007-04-09T21:13:02

You need to include the File::Find 'follow' option, which you can do in FFR with the extras() method.