Finding Duplicate Files

Ovid on 2009-10-06T08:52:03

I'm trying to find files in one directory which are in another directory. The following works, but I assume there's an easier way?

for file in `ls aggtests/pips/api/v1/xml/*.t | cut -d/ -f6`; do find aggtests/pips/api/builder/ -name $file; done

Update: Rewrote to fix a tiny grep bug.


fdupes, File::Find::Duplicate

brunov on 2009-10-06T10:46:15

Unless your definition of duplicate files includes files with the same name but possibly different contents, fdupes (available in Ubuntu, for instance) could be what you need. It uses md5sum to identify duplicate files, so it'll even work for identical files with different names.

A homegrown version can be found here (http://www.perlmonks.org/?node_id=703798), it uses File::Find::Duplicate.

Lots of hash based solution

ajt on 2009-10-06T10:51:28

There are lots of hash based solutions that can do this, some of them are easy written in Perl...

http://en.wikipedia.org/wiki/Fdupes, also lists alternatives (including one of mine!).

uniq

mauzo on 2009-10-06T15:19:24

(ls aggtests/pips/api/v1/xml/*.t; ls aggtests/pips/api/builder/*.t) | sort | uniq -d

Re:uniq

merlyn on 2009-10-09T00:00:11

"ls aggtests/pips/api/v1/xml/*.t"

"ls... glob" FAIL. Please don't do that.

Re:uniq

merlyn on 2009-10-09T00:03:27

I just realized that message is probably insufficient. Here's the "dangerous use of ls" message, spelled out a bit better: http://groups.google.com/group/comp.unix.shell/msg/5d19dadaf9329f87

Re:uniq

mauzo on 2009-10-09T00:33:08

OK, so had I thought a bit more I might have written

echo .../*.t .../*.t | sort | uniq -d

The other point, that filenames can contain special characters, I was aware of, but I tend to assume that 'my' files won't (unless I know that they do). If I were working on some arbitrary set of files I would have done the job in Perl (I was going to say find -print0 and {sort,uniq} -z would work, but apparently (my) uniq doesn't have a -z option. Weird.). Thanks for the correction, though, since it's important to be aware of in general.

A more important bug I was also ignoring is that the length of the list of files may exceed ARG_MAX. Since this is one of Ovid's test directories, I presume that's not actually that unlikely :).

Re:uniq

Aristotle on 2009-10-09T14:07:28

echo .../*.t .../*.t | sort | uniq -d

That won’t do what you wanted because echo will output the whole shebang on a single line. What you want instead is

printf '%s\n' .../*.t .../*.t | sort | uniq -d

But then that still won’t do what you wanted, because you aren’t chopping the base path off the file names, so no two lines will have the same content anyway. You need to something like this:

printf '%s\n' .../*.t .../*.t | cut -d/ -f2- | sort | uniq -d

Of course, as mentioned, that doesn’t account for the possibility of newlines in file names. And trying to do so is awkward since not all Unix utilities have switches to enable null- rather than newline-terminated records, uniq and cut among them.

I'd just...

AndyArmstrong on 2009-10-06T18:01:24

...do a find in each directory and then diff the results.

cd firstdir && find . -type f > ~/first cd seconddir && find . -type f > ~/second cd ~ && diff first second

You can filter the output of the diff of course :)