We need a *really* fast find library.

schwern on 2003-10-21T01:33:37

On OS X there's a little package management tool called fink. Its no Debian, but its ok. Listing the available packages takes a few seconds longer than I liked on my iBook so I decided to do a little profiling to figure out what was being slow.

Turns out it was two things. One was their version comparision function. It was repeating over 90% of its calls so I memoized it. That knocked out about 25% of the runtime.

The other was a function that was simply looking for new .info files in a big directory tree to see if the package index cache had to be updated. With the version comp optimized this was sucking down about 35% of the runtime. It was just a simple recursive directory crawl. read a directory, recurse into directories, look for a new .info file.

sub search_comparedb { my $path = shift; my (@files, $file, $fullpath, @stats);

# FIXME: should probably just check dirs of $config->get_treelist() opendir(DIR, $path) || die "can't opendir $path: $!"; @files = grep { !/^[\.#]/ } readdir(DIR); closedir DIR;

foreach $file (@files) { $fullpath = "$path/$file";

if (-d $fullpath) { next if $file eq "binary-$debarch"; next if $file eq "CVS"; return 1 if (&search_comparedb($fullpath)); } else { next if !(substr($file, length($file)-5) eq ".info"); @stats = stat($fullpath); return 1 if ($stats[9] > $db_mtime); } } return 0; }

Pretty straight forward. A few small touch ups could be done, but basically ok code. And it was sucking down a third of the runtime. This is over directory trees containing 2700 files and about 50 non-ignored directories. Just 50 calls to this function ate nearly a second or runtime on my iBook.

I could have tried unrolling the recursion, but I never really learned how to do that (I'd be interested to be shown how its done). Anyhow, I did the simple thing:

sub search_comparedb { my $path = shift; $path .= "/"; # forces find to follow the symlink

# Using find is much faster than doing it in Perl return (grep !m{/(CVS|binary-$debarch)/}, `/usr/bin/find $path -type f -name '*.info' -newer $basepath/var/db/fink.db`) ? 1 : 0; }

find. We can get away with this because the code only runs on OS X where /usr/bin/find always exists. How much faster was this? So much faster that the subroutine completely fell off the profiling radar.

Benchmark: timing 100 iterations of find, orig... find: 9 wallclock secs ( 0.19 usr 0.00 sys + 1.34 cusr 4.58 csys = 6.11 CPU) @ 526.32/s (n=100) orig: 49 wallclock secs (21.28 usr + 0.00 sys = 21.28 CPU) @ 4.70/s (n=100)

God damn. Some sort of Perl module wrapper around a find C library.

What was the next bottleneck? Oddly enough, Storable::retrieve. Its the top time killer now gulping down 40% of the runtime loading up that index file we just checked to see if it needed updating. But I'd already nearly doubled the performance so I stopped there.


Shameless plug

ethan on 2003-10-21T07:24:12

Not sure whether MacOSX has a locate utility. If so, maybe thiscould be useful (can't be installed via the CPAN module however, since an ancient 0.50 version by another author existed at some time).

It wraps the C code from the GNU findutils into some XS to natively query an existing locatedb.

Re:Shameless plug

schwern on 2003-10-21T13:24:07

Not sure whether MacOSX has a locate utility.


It does.

It wraps the C code from the GNU findutils into some XS to natively query an existing locatedb.


Nice, but fraught with all the problems of locate. ie. It'll be out of date. Though maybe I can crib some code out of it.

faster...

trachtenberga on 2003-10-21T22:51:30

I bet you could do some "! -path '*/CVS/* -and ! -path 'debian-$binarch -and ...'" kung-fu to eliminate the Perl grep and make it even faster.

When fast is fast enough.

schwern on 2003-11-06T14:04:55

Perl's grep isn't slow. Anyway, the subroutine doesn't even show up in profiling anymore, so I'm not going to invest any more time in optimizing it.