Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example: "simple" dedupe usage #2

Open
chapmanjacobd opened this issue Jul 19, 2023 · 3 comments
Open

example: "simple" dedupe usage #2

chapmanjacobd opened this issue Jul 19, 2023 · 3 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@chapmanjacobd
Copy link

After indexing 10 million images, there are so many options I'd just like to pick something relatively sane/conservative and apply that selection rather than manually go through each duplicate.

Is there an easy way to select/nuke all duplicates after indexing with -i.algos 1 -update and of the copies in the group preserve only the one with the highest resolution or compressionRatio ?

@scrubbbbs scrubbbbs added question Further information is requested enhancement New feature or request labels Jul 19, 2023
@scrubbbbs
Copy link
Owner

Automatic deletion for potentially mismatched results (there can always some even at low thresholds) hasn't been a consideration yet. When I have a lot of deletions to make I turn on the difference image (Z) and zip through them.

But as for your use case. My first thought was you could sort the result groups, look through them to make sure your idea is sane, then use -first -nuke to take out the worst/lowest one. Then repeat this until none remain.

However sorting for result groups is not implemented - they are always sorted by score. This is simple to add, but for now it means you can't try this.

There is a problem with this idea (besides the potential to delete false matches), which is the metric to select the "best" duplicate. For example because of up-scaling a higher resolution file might look worse. Or because of sharpen filter a lower compression might be a worse. Or maybe you don't care and either is fine for the application (e.g. ML training)

I have an experimental "quality score" metric to try to solve this, you have to press "Q" in the browser to compute it, then it shows in the lower right of the info box. If I could prove this was reasonable on a large set, maybe we can add it as a property to do this as you have suggested.

@chapmanjacobd
Copy link
Author

chapmanjacobd commented Jul 22, 2023

Okay so to remove exact duplicates this seemed to work:

cbird -dups -select-result -sort-rev resolution -chop -nuke

And for similar images this seemed reasonable:

cbird -p.dht 1 -similar -select-result -sort-rev resolution -chop -nuke

The default sort, score, is good enough for my use. not sure how much better "quality score" would be... In the images that I looked at quality score always was higher on the left-most copy.

Thanks !

btw. It would be nice if something like

cbird -p.dht 0.5

would be possible? I'm assuming it is a limitation of the algorithm--but it would be nice to be able to be a bit more granular

@scrubbbbs
Copy link
Owner

Hey, I'm glad you found a solution, thanks for following up.

I think you got lucky on the quality. When using -similar the first is the needle/query image. The needle selection is uncontrolled, it's just the first one that appeared when scanning for matches, so at best there is a weak ordering from when it was indexed.

As for DCT hash, the distance function is integer so that isn't an option. Granularity could be immediately improved by using a wider hash (currently 64 bits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants