Dear Judges, in the remote event that this entry makes it to the final round: DO NOT award it the 'Best Utility'. How about 'Most Likely to Ruin Disk Manufacturers'? Be creative !-) He who says disk space is free thinks money grows on directory tree. The source is expected to conform to IEEE Std 1003.1-1990 ("POSIX"). Thank God the IEEE does not standardize a coding style... What this program does ---------------------- This is best explained in the man page: SAMEFILE(1) User Manuals SAMEFILE(1) NAME samefile - find identical files SYNOPSIS samefile DESCRIPTION samefile reads a list of file names (one file name per line) from stdin. For each file name pair with identical con- tents, a line consisting of six tab separated fields is out- put: The size in bytes, two file names, the character ``='' if the two files are on the same device, ``X'' otherwise, and the link counts of the two files. The output is sorted in reverse order by size samefile uses two stages to give optimum performance. In the first stage, all non-plain files are silently ignored (directories, devices, FIFOs, sockets, symbolic links) as well as files for which stat(2) or open(2) fails. The result of the first stage (the file names) is written into a binary tree with one node for every file size. Each node is in turn a linked list of file names along with inode and device information. It is also at this first stage where checks for hard links are done. For any inode only one filename will be added to the binary tree. In the second stage all files having the same size are com- pared against each other. The rules of mathematical logic are applied to reduce work and output noise: if files a, b, and c have the same size and samefile finds that a = b and a = c then it will not compare b against c (and will not out- put a line for b and c) but only for a = b and a = c. The algorithm used will detect equality across arbitrarily long chains. Note however, that because only the first filename per inode gets into the second stage, the output for a group of identical files with different inode numbers is also minimized. Suppose you have six identical files of size 100 in an inode group consisting of the three inodes with numbers 10, 20 and 30: $ ls -i 10 file1 20 file4 30 file6 10 file2 20 file5 10 file3 $ ls | samefile 100 file1 file4 = 3 2 100 file1 file6 = 3 1 The sum of the sizes in the first column is the amount of disk space you could gain by making all 6 files links to only one file or remove all but one of the files. To be precise, disk space is allocated in blocks - you will prob- ably gain two blocks here, rather than 200 bytes. Note that it is not enough to just remove file4 and file6 (you would gain only 100 bytes because file5 still exists.) LIMITATIONS Samefile was written with no limits in mind. The number of input lines is unlimited. The size of the actual files is only limited by available virtual memory needed to compare one pair of files. The only hard limit is the constraint that there should not be more than about 8192 files having the same size. Experience has shown that there are rarely more than a couple dozen files of the same size. EXAMPLES For everybody: What are the duplicates under my home directory? $ find $HOME | samefile For the sysadmin folks: Report all duplicate files under /usr larger than 16k: $ find /usr -size +16384c -a -type f | samefile For the ftp and WWW admins: How much space is wasted below our site's /pub directory? $ find /pub -type f | samefile | awk '{sum += $1} END {print sum}' EXIT STATUS If samefile runs out of memory the exit status is 1, zero otherwise. SEE ALSO find(1), ln(1), rm(1) IOCCC Last change: OCTOBER 1997 2 Why I think it is obfuscated ---------------------------- - one character names make the source incomprehensible - the obligatory multilevel macro madness with unbalanced parentheses and operators without operands to confuse indent - A few macros come in from the command line - use of a few trigraphs. You *do* know your trigraphs, don't you? - use of a few digraphs. You *do* know your Amendment 1, don't you? - the code is commented but this gives only an *extremely* faint clue - the layout does not reflect the program structure, or does it? - The data structure the program builds is an ordered binary tree where each node consists of a linked list. This is probably too heavy to comprehend for N percent of all programmers ;-) - The tree is built and traversed by recursive function calls. - Why does this program look at dev_t and ino_t? Heavy wizardry. - Can you find out how the algorithm works that detects identity of files across arbitrarily long chains? In my private unobfuscated source there are 17 lines describing just that. A classical case for the well known /* You're not supposed to understand this */ comment. - Without the digraphs the source lints clean under Solaris' lint. - Dijkstra would say "goto considered harmful" - The only string literal is split in more than a dozen substrings - For the task at hand the code is pretty terse. I have tried to make the preprocessor directives as short as possible. This means no spaces after #include and no spaces after some of the macro names.