Having been so meticulous about taking back ups, I’ve perhaps not as been as careful about where I stored them, so I now have a loads of duplicate files in various places. I;ve tried various tools fdupes, czawka etc. , but none seems to do what I want… I need a tool that I can tell which folder (and subfolders) is the source of truth, and to look for anything else, anywhere else that’s a duplicate, and give me an option to move or delete. Seems simple enough, but I have found nothing that allows me to do that… Does anyone know of anything ?
Write a simple script which iterates over the files and generates a hash list, with the hash in the first column.
find . -type f -exec md5sum {} ; >> /tmp/foo
Repeat for the backup files.
Then make a third file by concatenating the two, sort that file, and run “uniq -d”. The output will tell you the duplicated files.
You can take the output of uniq and de-duplicate.
Thanks @speculatrix - I wish I had your confidence in scripting - hence I’m hoping to find something that does all that clever stuff for me… The key thing for me is to say something like multimedia/photos/ is the source of truth anything found elsewhere is a duplicate …
I wish I had your confidence in scripting
You know how you get it? by fucking around and finding out! I’d say give it a go!
Do a dry run of the de-dup to make sure you don’t delete anything you care about.
Give me a few years and maybe :P - but for now I’d rather not risk important data with my own limited skills especially if there is a product out there that it’s tried and tested and hopefully recommended by someone in this sub… I didn’t expect my ask to be quite so unique…
I’ve used dupeGuru on windows for cleaning up my photos, worked great for that. Has a GUI and also works on linux!
https://dupeguru.voltaicideas.net/
Thanks - I think I tried that - but at the time it had no concept of a source (location) of truth to preserve / find duplicates against - has that changed ? They don’t seem to reference that specific capability on that link ?
I think you’re asking for a duplicate finder that can tell where that file came from(source of truth?)
Most duplicate finders work by hashing the files and looking for matches. If the file indicated where it came from it would have a different hash and not be found to be a duplicate.
So I don’t think what you’re asking for can be done. But I’m not sure I understand what you’re asking.
If you’re 100% sure that the dupes are only between your source of truth and “everything else”, you can run fdupes then grep -v /path/to/source/of/truth/root
the output - all the file paths that remain are duplicate files outside your source of truth, which can be deleted.
Only runs on windows but I’ve been using double killer for years. Simple and does the trick
Thanks @CrappyTan69 - I ideally need this to run on my NAS, and if possible be opensource/free - looks like for what I’d need Double Killer for, it’s £15/$20 - maybe an option as a last resort…