-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example: "simple" dedupe usage #2
Comments
Automatic deletion for potentially mismatched results (there can always some even at low thresholds) hasn't been a consideration yet. When I have a lot of deletions to make I turn on the difference image (Z) and zip through them. But as for your use case. My first thought was you could sort the result groups, look through them to make sure your idea is sane, then use However sorting for result groups is not implemented - they are always sorted by score. This is simple to add, but for now it means you can't try this. There is a problem with this idea (besides the potential to delete false matches), which is the metric to select the "best" duplicate. For example because of up-scaling a higher resolution file might look worse. Or because of sharpen filter a lower compression might be a worse. Or maybe you don't care and either is fine for the application (e.g. ML training) I have an experimental "quality score" metric to try to solve this, you have to press "Q" in the browser to compute it, then it shows in the lower right of the info box. If I could prove this was reasonable on a large set, maybe we can add it as a property to do this as you have suggested. |
Okay so to remove exact duplicates this seemed to work:
And for similar images this seemed reasonable:
Thanks ! btw. It would be nice if something like
would be possible? I'm assuming it is a limitation of the algorithm--but it would be nice to be able to be a bit more granular |
Hey, I'm glad you found a solution, thanks for following up. I think you got lucky on the quality. When using As for DCT hash, the distance function is integer so that isn't an option. Granularity could be immediately improved by using a wider hash (currently 64 bits). |
After indexing 10 million images, there are so many options I'd just like to pick something relatively sane/conservative and apply that selection rather than manually go through each duplicate.
Is there an easy way to select/nuke all duplicates after indexing with
-i.algos 1 -update
and of the copies in the group preserve only the one with the highestresolution
orcompressionRatio
?The text was updated successfully, but these errors were encountered: