Finding Unused Files

Ovid on 2007-09-10T08:38:54

In part of a code cleanup, we are going to eliminate unused files in a standard Perl/CGI app which has been around for 7 to 10 years. With thousands of files, my first thought was something like this hack:

#!/usr/bin/perl 

use strict;
use warnings;

my @files =  map  { s{^\./}{}; $_ }
    `find . -type f | grep -v CVS`;

chomp(@files);

my $count = @files;
my $curr  = 1;

foreach my $file (@files) {
    print "Processing $file.  $curr out of $count.\n";
    $curr++;
    my $no_ext = $file;
    $no_ext =~ s{^.*?([^./]+)(?:\.\w+)?$}{$1};

    # Find all files | Exclude this file from list | find files which
    # reference it
    my $command = "find . -type f |grep -Ev '$file|text_r5_c2'|xargs grep -l '$no_ext'";
    unless ( `$command` ) {
        warn $file,$/;
    }
}

It's pretty ugly and *nix specific, but the basic idea is this:

Find all files
Foreach file, find all files not matching that file name
For remaining files, if no file match the bare filename (without path or extension), then we have an orphan file

It seemed reasonable, but ignoring the fact that it's very slow, the obvious problem kicked in: if you have an entire section of code no longer being used, it can be a self-referential section and therefore is unlikely to show up on this list. This app doesn't have a robust enough test suite to figure this out. Time for another strategy.

A coworker suggested grepping the access logs. Now I feel really stupid since this is so obvious. If a file shows up in there, we know we probably want to keep it. If it doesn't, it merits further investigation.

Linux?

Alias on 2007-09-10T14:38:25

If it's a linux box, there's probably a chance you can put the atime values to use for once.

Re:Linux?

Ovid on 2007-09-10T14:48:51

Thought about that, but there are problems there. First, they completely fail if anyone's been just perusing the code in vim, yes? (I've been doing a lot of that learning the code base). Also, robots such as Googlebot might access old files which haven't been linked to in a while. Of course, I'm not much of an administrator, so if I'm wrong, let me know!