A utility to analyze output of fdupes linux utility to find level of overlap between directories. Written in R. https://github.com/codecliff/FdupesAnalyzer
fdupes by Adrián López gives you a file-by-file list of duplicates. It works very well with renamed copies and files exported by image editors and such. However, to clean up a large dump of files accumulated over years by multiple users, I needed to see things like 70% of files in dir A are also in dir B
, dir A has copies of all the files in dir B
etc. This utility script creates a csv file with all this information.
- Run fdupes and redirect results to file.
fdupes -Sr rootpath >> fdupes_output.txt
- Edit R script
FDupesParser.R
, update path for output file and rootpath. - Run R script (Preferably in interactive mode, preferably in RStudio)
- Go over the csv file generated by script
- (Optional) Generate fdupes commands for each directory pair and run as a batch
- "dir1" : directory 1
- "dir2" : directory 2
- "matchcnt": no. of files matching between dir1 and dir2
- "acnt" : file count in dir1
- "bcnt" : file count in dir2
- "aprct" : percent of files in dir1 which have copy in dir2
- "bprct" : same for dir2
- "maxprct" : max of above two
sudo fdupes -dN "./imgs/music" "./imgs/2018-03-oldccombk/stuff/"
sudo fdupes -dN "./ntfs/2017-backup/weds" "./IMAGES/Pictures_2017/.mail_downloads"
sudo fdupes -dN "./IMAGES/Picture/weds" "./IMAGES/Pictures_2017/oldlaptop_hdd"
- R
- R Packages :
data.table, tools
- fdupes
- Ubuntu 18.04
- R 3.6.2
- RStudio 1.1.463