samefile

cog on 2004-12-25T22:09:17

I'm cleaning up my hard disk.

Here's one of the tasks I came across with: a bunch of files in a directory (almost 200), of which I know some of them are the same; I just don't know which ones.

Here's samefile, the script I created to find those files:

#!/usr/bin/perl
use strict;
use warnings;
use File::Compare;
use Getopt::Std;

our %opts = get_options();

show_help()             if $opts{h};
show_version()          if $opts{V};

get_sizes();
find_copies();

# subroutines

sub find_copies {
  our %sizes;
  for (values %sizes) {
    my @files = @{$_};

    while (my $f = shift @files) {
      @files || next;
      my @copies = grep {! compare($f, $_)} @files or next;
      print "\"$f\"", (map {" \"$_\""} @copies), "\n";
      for my $s (@copies) {@files = grep {$_ ne $s} @files}
      }
  }
}

sub get_sizes {
  our %sizes;
  for (@ARGV) {
    if (-f) {
      push @{$sizes{(stat)[7]}}, $_;
    }
    elsif ($opts{r} && -d) {
      push @ARGV, <"$_/*">;
      }
  }
}

sub get_options {
  my %opts;
  getopts('rhV', \%opts );

  for my $key ( keys %opts ) {
    $opts{$key} = 1 unless defined $opts{$key}
  }

  %opts;
}

sub show_help {
  die "Usage: samefile file1 file2
 or:   samefile -r *
samefile: identifies equal files

Options:
  -h         displays this messages and exit
  -r         recursive mode
  -v         show version and exit
"
}

sub show_version {
  die "samefile version 0.01\n";
}

It currently prints something like this:


$ samefile *
"file1" "file3" "file6"
"file4" "file5"

Meaning that file1, file3 and file6 are all alike and likewise for file4 and file5.

Comments on the output or anything else are welcome...


Re:

Aristotle on 2004-12-26T08:12:50

dupmerge?

Re:

Aristotle on 2004-12-26T08:34:21

Thoughts on your code:

File::Compare already does the size comparison thing for you so there's no need for you to collect filesizes.

You never check for symbolic links and you are missing the opportunity to compare inodes. If two filenames are links (hard or symbolic) to the same file, there's no need to compare the file to itself.

Another reason checking for symlinks is important is that if your code encounters a symlink to .. while recursing, it will just sit there twiddling thumbs. It's probably easier to leave recursion to File::Find.

Invoking compare() with the same file as source over and over is a horrible waste of work. You should hash your files first (Digest::SHA1 or Digest::MD5 are handy here), and then compare their hash values. That way you can compare a file to ten other files with hardly any work — could speed things up by orders of magnitude. (If you are paranoid, you can hash the files twice with different algorithms, and then the probability that different files will hash to identical values goes from “very close to zero” to “vanishingly close to zero for this universe”.)

Re:

cog on 2004-12-26T20:35:10

Everything you say makes sense :-)

I had no links, so that was not a problem. I managed to downsize 3.7G of a Portuguese TV show down to 2.8G with this... awesome :-)

I'll take a look on dupmerge, as you say (it might end up on my ~/bin)

Thanks :-)