Rdfind – redundant data find

Introduction

Rdfind is a program that finds duplicate files. It is useful for compressing backup directories or just finding duplicate files. It compares files based on their content, NOT on their file names.

When I want to change some file, I am often nervous to break something and therefore copy all the old files to some directory named app_2006xxxx or whatever. The same when I switch computer system and am afraid to lose my old stuff. This makes all my files exist in numerous places, and I never feel like cleaning up. This is where rdfind comes in handy. It will find those files and report them to you. Optionally, erase them or replace them with links (hard or symbolic). Rdfind is a command line tool – that means there is no GUI.

Install

There are precompiled packages for newer versions of Debian and Ubuntu. Installation is as easy as

$apt-get install rdfind

for those distributions (details are a bit further down this page). If you are on Mac, you can install through MacPorts. If you want to compile the source yourself, that is fine. Rdfind is written in c++ and should compile under any *nix. Rdfind is currently running under linux, Mac OS X, Solaris and Windows (using cygwin).

Getting the source code

The packages are signed with keys indicated in the separate column.

Old key, ID 0xAB0234EB: rdfind0xAB0234EB.asc

Older (expired) key, ID 0x509CCB46: rdfind0x509CCB46.asc

Version

File

Signature

Key

Checksum (SHA1)

1.3.4

rdfind-1.3.4.tar.gz

rdfind-1.3.4.tar.gz.asc

0x533B6030

c01bd2910cdec885b6c24164a389457e4f01ef61

1.3.3

rdfind-1.3.3.tar.gz

rdfind-1.3.3.tar.gz.asc

0x533B6030

70ce33c6c393ba309dc4791c73489a73652a0be6

1.3.2

rdfind-1.3.2.tar.gz

rdfind-1.3.2.tar.gz.asc

0x533B6030

4893904f895400faa9ca0ea042a97eb4536a820e

1.3.1

rdfind-1.3.1.tar.gz

rdfind-1.3.1.tar.gz.asc


c596e9e0d059e37135c9db62904426e37c879885

1.3.0

rdfind-1.3.0.tar.gz

rdfind-1.3.0.tar.gz.asc


18a0fab3bd6951aa342d9385c3bc13bf615e1253


Note to self: export pkg=rdfind-x.x.x.tar.gz; sha1sum $pkg; gpg -u 0x533B6030 -a -b $pkg

Installing via debian package repository

Rdfind is available as an official package in Debian and Ubuntu. These apply to the latest editions (as of 20110724: Debian Wheezy and Ubuntu Oneiric). Just execute (sudo) apt-get install rdfind to install it. For older versions of Debian and Ubuntu, one can use the repository maintained by Salvatore Ansani, who has been kind to generate packages for Debian. Thanks! See  http://ansani.it/my-debian-repository/ for details.

Installing from source (generic)

Installing from source requires the nettle library. Note that the nettle library is available as a precompiled package in both Ubuntu, Debian and Mac Os X (via Macports). It might be easier to install it using one of those systems than installing it from source.

Here is how to get and install nettle from source. Please check for the current version before copying the instructions below:

wget ftp://ftp.lysator.liu.se/pub/security/lsh/nettle-1.14.tar.gz -nc
wget ftp://ftp.lysator.liu.se/pub/security/lsh/nettle-1.14.tar.gz.asc -nc
wget ftp://ftp.lysator.liu.se/pub/security/lsh/distribution-key.gpg -nc
gpg --fast-import distribution-key.gpg                    # omit if you do not want to verify
gpg --verify nettle-1.14.tar.gz.asc --nettle-1.14.tar.gz  # omit if you do not want to verify
tar -xzvf nettle-1.14.tar.gz
./configure
make
su # Only if you have root privileges. See note below.
make install
exit

If you install nettle as non-root, you must create a link in the rdfind directory so that rdfind later can do #include "nettle/nettle_header_files.h" correctly. Use for instance the commands

cd nettle-1.14
cd ..
ln -s nettle-1.14 nettle

Next step is to build rdfind

Download the source code using one of the links in the table above under ”Getting the source”.

Build rdfind:

[pauls@localhost tmp]$ tar xvf rdfind-1.3.1.tar.gz
[pauls@localhost tmp]$ cd rdfind-1.3.1/
[pauls@localhost rdfind-1.3.1]$./configure    # see note below
[pauls@localhost rdfind-1.3.1]$make
[pauls@localhost rdfind-1.3.1]$su             # if you have root privileges
[pauls@localhost rdfind-1.3.1]$make install

Note that if nettle is not installed in a standard place, you might need to pass LDFLAGS=-L../path/to/nettle/library CPPFLAGS=-I../path/to/nettle_headerfiles to configure.

A Solaris 10 user reports that rdfind compiles fine with .the configure line ./configure CPPFLAGS=-I/usr/local/include LDFLAGS="-L/usr/local/lib -lrt"
The same user reports that the -lrt flag probably is necessary for compiling on sparc as well.

Usage

The syntax is

rdfind [options] directory_or_file_1 [directory_or_file_2] [directory_or_file_3] ...

Without options, a results file will be created in the current directory. For full options, see the man page.

Examples

Basic example, taken from a *nix environment:
Look for duplicate files in directory /home/pauls/bilder:

[pauls@localhost ~]$ rdfind /home/pauls/bilder/
Now scanning "/home/pauls/bilder", found 3301 files.
Now have 3301 files in total.
Removed 0 files due to nonunique device and inode.
Now removing files with zero size...removed 3 files
Total size is 2861229059 bytes or 3 Gib
Now sorting on size:removed 3176 files due to unique sizes.122 files left.
Now eliminating candidates based on first bytes:removed 8 files.114 files left.
Now eliminating candidates based on last bytes:removed 12 files.102 files left.
Now eliminating candidates based on md5 checksum:removed 2 files.100 files left.
It seems like you have 100 files that are not unique
Totally, 24 Mib can be reduced.
Now making results file results.txt
[pauls@localhost ~]$                  

From the last row, it is seen that there are 100 files that are not unique. Let us examine them by looking at the newly created results.txt:

[pauls@localhost ~]$ cat results.txt
# Automatically generated
# duptype id depth size device inode priority name
DUPTYPE_FIRST_OCCURENCE 960 3 4872 2056 5948858 1 /home/pauls/bilder/digitalkamera/horisontbild/.xvpics/test 001.jpg.gtmp.jpg
DUPTYPE_WITHIN_SAME_TREE -960 3 4872 2056 5932098 1 /home/pauls/bilder/digitalkamera/horisontbild/.xvpics/test 001.jpg
.
(intermediate rows removed)
.
DUPTYPE_FIRST_OCCURENCE 1042 2 7904558 2056 6209685 1 /home/pauls/bilder/digitalkamera/skridskotur040103/skridskotur040103 014.avi
DUPTYPE_WITHIN_SAME_TREE -1042 3 7904558 2056 327923 1 /home/pauls/bilder/digitalkamera/saknat/skridskotur040103/skridskotur040103 014.avi
# end of file

Consider the last two rows. It says that the file skridskotur040103 014.avi exists both in /home/pauls/bilder/digitalkamera/skridskotur040103/ and /home/pauls/bilder/digitalkamera/saknat/skridskotur040103/. I can now remove the one I consider a duplicate by hand if I want to.

Algorithm

Rdfind uses the following algorithm. If N is the number of files to search through, the effort required is in worst case O(Nlog(N)). Because it sorts files on inodes prior to disk reading, it is quite fast. It also only reads from disk when it is needed.

  1. Loop over each argument on the command line. Assign each argument a priority number, in increasing order.

  2. For each argument, list the directory contents recursively and assign it to the file list. Assign a directory depth number, starting at 0 for every argument.

  3. If the input argument is a file, add it to the file list.

  4. Loop over the list, and find out the sizes of all files.

  5. If flag -removeidentinode true: Remove items from the list which already are added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”!

  6. Sort files on size. Remove files from the list, which have unique sizes.

  7. Sort on device and inode(speeds up file reading). Read a few bytes from the beginning of each file (first bytes).

  8. Remove files from list that have the same size but different first bytes.

  9. Sort on device and inode(speeds up file reading). Read a few bytes from the end of each file (last bytes).

  10. Remove files from list that have the same size but different last bytes.

  11. Sort on device and inode(speeds up file reading). Perform a checksum calculation for each file.

  12. Only keep files on the list with the same size and checksum. These are duplicates.

  13. Sort list on size, priority number, and depth. The first file for every set of duplicates is considered to be the original.

  14. If flag ”-makeresultsfile true”, then print results file (default). Exit.(?)

  15. If flag ”-deleteduplicates true”, then delete (unlink) duplicate files. Exit.

  16. If flag ”-makesymlinks true”, then replace duplicates with a symbolic link to the original. Exit.

  17. If flag ”-makehardlinks true”, then replace duplicates with a hard link to the original. Exit.

Alternatives and benchmark

There are some interesting alternatives.

Duff: http://duff.sourceforge.net/ by Camilla Berglund.

Fslint: http://www.pixelbeat.org/fslint/ by Pádraig Brady

A search on ”finding duplicate files” will give you lots of matches.

Here is a small benchmark. Times are obtained from ”elapsed time” in the time command. The command has been repeated several times in a row, where the result from each run is shown in the table below. As the operating system has a cache for data written/read to the disk, the consecutive calls are faster than the first call. The test computer is a 3 GHz PIV with 1 GB RAM, Maxtor SATA 8 Mb cache, running Mandriva 2006.

Test case 

duff 0.4

Fslint 2.14

Rdfind 1.1.2

command line

time ./duff -rP dir >slask.txt

time ./findup dir >slask.txt

time rdfind dir

Directory with 3301 files (2782 Mb jpegs) in a directory structure, from which 100 files (24 Mb) are redundant.

0:01.55 
0:01.61 
0:01.58

0:02.59 
0:02.66 
0:02.58

0:00.49 
0:00.50 
0:00.49

Directory with 35871 files (5325 Mb) in a directory structure, from which 10889 files (233 Mb) are redundant

3:24.90 
0:46.48 
0:46.20 
0:45.31

1:26.37 
1:16.36 
1:15.38 
0:53.20

0:29.37 
0:07.81 
0:06.24 
0:06.17

Note: units are minutes:seconds

Caveats/features

A group of hardlinked files to a single inode are collapsed to a single entry if -removeidentinode true. If you have two equal files (inodes) and two or more hardlinks for one or more of the files, the behaviour might not be what you think. Each group is collapsed to a single entry. That single entry will be hardlinked/symlinked/deleted depending on the options you pass to rdfind. This means that rdfind will detect and correct one file at a time. Running multiple times solves the situation. This has been discovered by a user who uses a ”hardlinks and rsync”-type of backup system. There are lots of such backup scripts around using that technique, Apple time machine also uses hardlinks. If a file is moved within the backuped tree, one gets a group of hardlinked files before the move and after the move. Running rdfind on the entire tree has to be done multiple times if -removeidentinode true. To understand the behaviour, here is an example demonstrating the behaviour:

$echo abc>a
$ln a a1
$ln a a2
$cp a b
$ln b b1
$ln b b2
$stat --format="name=%n inode=%i nhardlinks=%h" a* b*
name=a inode=18 nhardlinks=3
name=a1 inode=18 nhardlinks=3
name=a2 inode=18 nhardlinks=3
name=b inode=19 nhardlinks=3
name=b1 inode=19 nhardlinks=3
name=b2 inode=19 nhardlinks=3
#everything is as expected.

$rdfind -removeidentinode true -makehardlinks true ./a* ./b*
$stat --format="name=%n inode=%i nhardlinks=%h" a* b*
name=a inode=58930 nhardlinks=4
name=a1 inode=58930 nhardlinks=4
name=a2 inode=58930 nhardlinks=4
name=b inode=58930 nhardlinks=4
name=b1 inode=58931 nhardlinks=2
name=b2 inode=58931 nhardlinks=2

a, a1 and a2 got collapsed into a single entry. b, b1 and b2 got
collapased into a single entry. So rdinfd is left with a and b
(depending on which of them is received first by the * expansion)
It replaces b with a hardlink to a. b1 and b2 are untouched.

If one runs rdfind repeatedly, the issue is resolved, one file being
corrected every run.



Feature requests

From time to time, I get suggestions and get to know about interesting use cases. I will start to collect them here, which might lead to even better suggestions.

Handle massive amounts of hardlinks correct

Users with files having lots of hardlinks (approximately more than 65000 on ext4) will get into trouble when more files are hard linked to such files. Handling this situation would make rdfind survive the situation by creating a new group of hard linked files. Suggested by J 20121008.

Use database instead of ram for file list

Having lots of files may exhaust system memory. Letting rdfind use a database for the file list instead of memory reduces the load. This will of course be slower, but could be made optional. An additional benefit is that the results can be put in the database. Suggested by Andy Smith 20121010.

Optionally require user and permissons to match

If two different users have equal files, hard linking causes the files to have the same user and permissions afterwards. Adding options -matchuser and -matchperms allows these files to be removed from the deduplication process. Suggested by Andy Smith 20121010.

Cooperate with btrfs deduplication

btrfs has a tool (under development) called bedup for internal deduplication on btrfs filesystems. Letting it operate on groups of files found by rdfind, it is possible to let btrfs store some space. Suggested by XX 20130206.

Control minimum size of files

Sometimes it is of interest to ignore small files. rdfind already ignores empty files by default. The suggestion is to replace the -ignoreempty flag (or complement it) with a -minsize N flag where N is the minimum size in bytes. Using N=0 would mean empty files are included. This feature is planned to be included in a future release. Suggested by Andrew Buehler 20131130.

Two step action

Instead of running with dry-run first and a second invocation without the dry-run, the following was suggested:

This would be useful, but it introduces other problems. The format of the results file must be able to handle file name with newlines etc. A parser must be written which handles syntax errors, missing files etc. This makes me reluctant to implement such a feature. If so, it should probably be coordinated with the database suggestion. Suggested by VB 20140123

Author

Rdfind is written by Paul Dreik (previously Sundvall). If you find this software useful, please drop me an email! The address is x@y.z where x=rdfind, y=pauldreik, z=se.

Suggestions and comments are very welcome.