Find duplicate files? Available tools and/or scripts?

takimata · July 27, 2019, 2:39pm

I'm currently faced with the task of combing through a 15 year, 4 TB archive of files on my disk that is running comfortably in an OpenWrt-powered NAS. It's a veritable matryoshka doll of backups, backups within backups of backups, and loose files. I reckon about a third of the files on the disk are duplicates. But going through the forest of directories and finding those duplicates by hand is a rather tedious task.

A quick search gives me a few tools that should help find duplicate files, but so far I can't see any of them available as OpenWrt packages: fdupes may have been available a few years ago, but not anymore, fslint never was available, and neither was rdfind.

Has anyone else ever tackled a similar task, and if so, how? My "last resort" would be hooking the drive up to a "proper" Linux distribution and work it over from there, but I would rather do it on the NAS itself.

TIA!

jeff · July 27, 2019, 3:10pm

I'd probably attack the problem with something like

find /mnt-point -type f -exec md5sum {} \; > ~/all-the-md5sums.txt

then

sort ~/all-the-md5sums.txt > ~/all-the-md5sums-sorted.txt

followed by a (Python) script that prints groups of tile names with the same hash.

takimata · July 27, 2019, 3:17pm

That was my initial instinct, probably with a first step just going by file sizes alone (md5 is costly and unnecessary if the file size doesn't even match), and then go over the filtered/uniquified result in a second step and get md5 sums. I was hoping, though, that there is something prefabricobbled ... but if there isn't, this seems to be a somewhat sensible way, if only to draw a map where the duplications are.

jeff · July 27, 2019, 3:20pm

Yeah, I did that some time ago. I decided that I was only doing it once in a long while, so I could fire up the find process and come back a day or two later rather than trying to "get fancy" (and then ending up needing md5sum anyway).

takimata · July 27, 2019, 3:23pm

Exactly. I plan to do all of that once, in "The Great Data Cleanup of 2019". I guess the drive will have to sweat for a day or two then.

(If it's all the same to you, I would like to keep the question open for a little bit longer before marking it solved, in case someone else has an even better idea.)

diizzy · July 27, 2019, 4:05pm

jdupes and rdfind are both available in Alpine Linux and compiles without any patches should both so be easy to package. dupd seems to need a minor one https://aur.archlinux.org/cgit/aur.git/tree/test.patch?h=dupd-git

takimata · July 27, 2019, 4:52pm

Thank you for the suggestions. However, in the time it takes me to set up a build system to build one of those, and quite possibly have to fiddle around getting them compiled, I'm pretty sure I will be faster just to md5 the whole drive (unattended) and write some small script to analyze the result.

I was hoping that I overlooked one of those tools being available as a maintained package, but it seems that I didn't. And that's completely okay.

Hegabo · July 27, 2019, 9:25pm

Depending on how you do it, you could do it (or some of it) form a PC with the NAS mapped. Sure there will be overhead, but you will also have more out-of-the-box options there without affecting the physical setup.

sammo · July 28, 2019, 1:08pm

No need for sort, awk should be all you need

awk '{if (x[$1]) { x_count[$1]++; print $0; if (x_count[$1] == 1) { print x[$1] } } x[$1] = $0}' ~/all-the-md5sums.txt

takimata · July 28, 2019, 1:28pm

Wow, that is quite a mouthful of an awk statement, I will have to try to decode that one.

In the meantime I did two rather trivial things, I installed a proper find (findutils-find, busybox-find doesn't support -printf) and went over the files twice, once just for filesize and once for md5 (to save time I skipped the large video and audio file formats, those I reckon I can judge on their filename and size alone). That should give me a good dataset to find duplicates.

Thank you everyone for your input. I truly appreciate it.

jeff · July 28, 2019, 1:35pm

Sort provides you the ability to easily output the complete list of duplicates for each file, so you can make a decision about which of the two, three, ... you wish to retain.

takimata · July 28, 2019, 1:46pm

JFTR: I will have to do more than sort or awk over the list. From a cursory analysis I have something like 200'000 duplicate files (there's a lot of duplicates of my work, complete WordPress installs and such.)

The disk space the duplicates take up is actually not the problem -- that would be "cheap" and could be solved by a bigger disk. I need some sort of "roadmap" to sort out the mess of duplicate directories and duplicate backups. With the md5/filesize dataset and a bit of programming I should be able to do get that now.

JFTR, 2: I will mark the topic as solved and I will mark Jeff's initial post as the solution, even though all of your comments were helpful and on point. I can only mark one.

Hegabo · July 28, 2019, 2:06pm

If the duplicate files tend to be in folders of similar structures, then something like Meld (for PC) and the likes can make it easier. You could probably use mix of this and that.

diizzy · August 2, 2019, 6:11am

A bit late to the party but it was pretty much straight forward

takimata · August 2, 2019, 9:49am

Not at all. Thank you for your work on this. Do you intend to have them added to the official repository where they would be built for everyone? Again, not trying to be lazy here, but buildroot and all cough cough, I just lack anything close to a routine in compiling stuff for myself.

In the meantime I came up with a kludge for myself. I used vanilla find (the full one, not the stripped-down busybox version) to get a list for all files with their sizes, and in a second pass to get md5s for all but select file extensions (I figured the huge files -- videos, audios and camera raws -- I can compare with name and size alone). I spent maybe an hour writing a set of scripts to parse the resulting lists and find/group duplicates, and with a little bit of recursive magic also duplicate directories containing the same files. That was already a huge help. But given that I will have to probably do all of this a few times in a few passes, a dedicated tool is certainly the smarter way to go about it, if only because their process is far more efficient in comparing files.

diizzy · August 2, 2019, 7:40pm

The issue is the "maintainership" that indirectly requires end-user support which I can't provide in a timely manner and have little interest in to be honest which in turn makes me reluctant to submit packages (I have quite a few in my repo by now). On top of that my OpenWrt device fleet is slowly shrinking and therefore also time actually "using" it which makes it even more time consuming to debug non obvious bugs. To be clear, I'm not saying that I don't care about the end-user it's just that I don't have time to do the handholding stuff.

Right now I have one PR open being very clear about this intention but we'll see it pans out in the end.
PR 9475 in Packages repo if you want to have a look.

Also, you're welcome

diizzy · August 3, 2019, 10:49pm

If you want, I can do a build for mvebu or ipq40xx (snapshots).

system · August 13, 2019, 10:49pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.