example: "simple" dedupe usage #2

chapmanjacobd · 2023-07-19T02:47:09Z

After indexing 10 million images, there are so many options I'd just like to pick something relatively sane/conservative and apply that selection rather than manually go through each duplicate.

Is there an easy way to select/nuke all duplicates after indexing with -i.algos 1 -update and of the copies in the group preserve only the one with the highest resolution or compressionRatio ?

The text was updated successfully, but these errors were encountered:

scrubbbbs · 2023-07-19T11:11:59Z

Automatic deletion for potentially mismatched results (there can always some even at low thresholds) hasn't been a consideration yet. When I have a lot of deletions to make I turn on the difference image (Z) and zip through them.

But as for your use case. My first thought was you could sort the result groups, look through them to make sure your idea is sane, then use -first -nuke to take out the worst/lowest one. Then repeat this until none remain.

However sorting for result groups is not implemented - they are always sorted by score. This is simple to add, but for now it means you can't try this.

There is a problem with this idea (besides the potential to delete false matches), which is the metric to select the "best" duplicate. For example because of up-scaling a higher resolution file might look worse. Or because of sharpen filter a lower compression might be a worse. Or maybe you don't care and either is fine for the application (e.g. ML training)

I have an experimental "quality score" metric to try to solve this, you have to press "Q" in the browser to compute it, then it shows in the lower right of the info box. If I could prove this was reasonable on a large set, maybe we can add it as a property to do this as you have suggested.

chapmanjacobd · 2023-07-22T07:00:25Z

Okay so to remove exact duplicates this seemed to work:

cbird -dups -select-result -sort-rev resolution -chop -nuke

And for similar images this seemed reasonable:

cbird -p.dht 1 -similar -select-result -sort-rev resolution -chop -nuke

~~The default sort, score, is good enough for my use. not sure how much better "quality score" would be... In the images that I looked at quality score always was higher on the left-most copy.~~

Thanks !

btw. It would be nice if something like

cbird -p.dht 0.5

would be possible? I'm assuming it is a limitation of the algorithm--but it would be nice to be able to be a bit more granular

scrubbbbs · 2023-07-22T15:00:46Z

Hey, I'm glad you found a solution, thanks for following up.

I think you got lucky on the quality. When using -similar the first is the needle/query image. The needle selection is uncontrolled, it's just the first one that appeared when scanning for matches, so at best there is a weak ordering from when it was indexed.

As for DCT hash, the distance function is integer so that isn't an option. Granularity could be immediately improved by using a wider hash (currently 64 bits).

scrubbbbs added question Further information is requested enhancement New feature or request labels Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example: "simple" dedupe usage #2

example: "simple" dedupe usage #2

chapmanjacobd commented Jul 19, 2023

scrubbbbs commented Jul 19, 2023

chapmanjacobd commented Jul 22, 2023 •

edited

Loading

scrubbbbs commented Jul 22, 2023

example: "simple" dedupe usage #2

example: "simple" dedupe usage #2

Comments

chapmanjacobd commented Jul 19, 2023

scrubbbbs commented Jul 19, 2023

chapmanjacobd commented Jul 22, 2023 • edited Loading

scrubbbbs commented Jul 22, 2023

chapmanjacobd commented Jul 22, 2023 •

edited

Loading