Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 1.05 KB

README.md

File metadata and controls

30 lines (22 loc) · 1.05 KB

dupfilefinder

Python 3 duplicate file finder

####Usage:

dedup.py [-v]  path1 path2 ...

Scans through paths looking for duplicate files.

Process is:

  1. gather all files by size (map[size] = [file list])
  2. read first 1k bytes of same-size files, hash using sha-256
  3. gather files with same size and same hash
  4. if the files in this gathered group are less than 1k in size, mark as duplicate
  5. if the files in the gathered group are more than 1k in size, enqueue for further testing
  6. for each file in the group gathered by same size and 1k byte hash, with files greater than 1k, create new hash based on reading through the entire file.
  7. create new groups based on the new full-file hashes.
  8. if hashes are the same, mark as duplicates
  9. order by largest file first, mark earliest file as "original" and all others as duplicates
  10. report

Skip patterns

  • skip files < 100 bytes
  • skip thumbnails

Discussion

This process efficiently identifies the duplicate files.

Scanned 54917 files, found 32420 duplicate files, time: 20 seconds