dupfilefinder

Python 3 duplicate file finder

####Usage:

dedup.py [-v]  path1 path2 ...

Scans through paths looking for duplicate files.

Process is:

gather all files by size (map[size] = [file list])
read first 1k bytes of same-size files, hash using sha-256
gather files with same size and same hash
if the files in this gathered group are less than 1k in size, mark as duplicate
if the files in the gathered group are more than 1k in size, enqueue for further testing
for each file in the group gathered by same size and 1k byte hash, with files greater than 1k, create new hash based on reading through the entire file.
create new groups based on the new full-file hashes.
if hashes are the same, mark as duplicates
order by largest file first, mark earliest file as "original" and all others as duplicates
report

This process efficiently identifies the duplicate files.

Scanned 54917 files, found 32420 duplicate files, time: 20 seconds

Provide feedback