samefile - find identical files
samefile [-g size] [-l | -r] [-s sep] [-0aiqVvx]
samefile reads a list of filenames (one filename per line) from stdin. For each filename pair with identical contents, a line consisting of six fields is output: The size in bytes, two filenames, the character ‘‘=’’ if the two files are on the same device, ‘‘X’’ otherwise, and the link counts of the two files. The output is sorted in reverse order by size as the primary key and the filenames as the secondary key.
samefile uses two stages to give optimum performance.
In the first stage, all non-plain files are skipped (directories, devices, FIFOs, sockets, symbolic links) as well as files for which stat(2) fails and files that have a size less than or equal to size. Output of the first stage (the filenames) is written into a binary tree with one node for every file size. It is also at this early stage where checks for hard links are done. If hard links are found, and -r is requested, the name pairs are output immediately. The whole list of hard linked name pairs will therefore appear before any output of the second stage.
For any i-node only one filename will be added to the binary tree (unless -i was requested.)
In the second stage all files having the same size are compared against each other. The rules of mathematical logic are applied to reduce work and output noise (unless -x is requested): if files a, b, and c have the same size and samefile finds that a = b and a = c then it will not compare b against c (and will not output a line for b and c) but only for a = b and a = c. Note however, that because only the first filename per i-node gets into the second stage, the output for a group of identical files with different i-node numbers is also minimized. Suppose you have six identical files of size 100 in an i-node group consisting of the three i-nodes with numbers 10, 20 and 30 (the term ’i-node group’ has nothing to do with the i-node group notion of some file systems - it merely refers to a set of i-nodes addressing files with identical contents):
$ ls -i 10 file1 20 file4 30 file6 10 file2 20 file5 10 file3 $ ls | samefile 100 file1 file4 = 3 2 100 file1 file6 = 3 1
The sum of the sizes in the first column is the amount of disk space you could gain by making all 6 files links to only one file or remove all but one of the files. To be precise, disk space is allocated in blocks - you will probably gain two blocks here, rather than 200 bytes. Note that it is not enough to just remove file4 and file6 (you would gain only 100 bytes because file5 still exists.) The proper way is to use the -i option. The output will look like
100 file1 file2 = 3 3 100 file1 file3 = 3 3 100 file1 file4 = 3 2 100 file1 file5 = 3 2 100 file1 file6 = 3 1
Removing all files listed in the third field will leave only file1. Making all files hard links to file1 is easy. If the fourth field is a ‘‘=’’ do a forced hard link. If you need to know about all combinations of identical files, then you use both the -i and -x option. This produces
$ ls | samefile -ix 100 file1 file2 = 3 3 100 file1 file3 = 3 3 100 file1 file4 = 3 2 100 file1 file5 = 3 2 100 file1 file6 = 3 1 100 file2 file3 = 3 3 100 file2 file4 = 3 2 100 file2 file5 = 3 2 100 file2 file6 = 3 1 100 file3 file4 = 3 2 100 file3 file5 = 3 2 100 file3 file6 = 3 1 100 file4 file5 = 2 2 100 file4 file6 = 2 1 100 file5 file6 = 2 1
Find all identical files in the current working directory:
$ ls | samefile
Find all identical files in my HOME directory and subdirectories and also tell me if there are hard links:
$ find $HOME -type f | samefile -r
Find all identical files in the /usr directory tree that are bigger than 10000 bytes and write the result to usr.dups (that one is for the sysadmin folks, you may want to ’amp’ - put it in the background with the ampersand & - this command because it takes a few minutes.)
$ find /usr -type f | samefile -g 10000 >usr.dups
You will see a short usage message if you use an invalid option.
ln(1) , find(1) , rm(1) , df(1)
There are no known bugs. The source has been lint(1) ed and all possible care has been taken while coding. If you find a bug (or miss a feature) please contact the author.
The official samefile home page www.schweikhardt.net/samefile/ is maintained by the author Jens Schweikhardt - schweikh at schweikhardt dot net