-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Acorn-esque filtered HNSW search heuristic #14160
base: main
Are you sure you want to change the base?
Conversation
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
@msokolov I wonder your opinion here? Do you think the behavior change/result change is worth waiting for a major? I do think folks should be able to use this now, but be able to opt out. Another option I could think of is injecting a parameter or something directly into SPI loading for the hnsw vector readers. But I am not 100% sure how to do that. It does seem like it should be something that is a "global" configuration for a given Lucene instance instead of one that is provided at query time. |
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
EDIT: @benchaplin fixed a bug. Now candidate looks way better. We should test on some more datasets, but it seems to me the new way of filter can maybe be done all the way up to 90% filtered out. I did some more testing, this time single segment of our nightly runs. The recall & latency pattern is much healthier with this change, though the recall is lower. The only reason the recall is so high for the restricted filters is that the baseline over-eagerly drops to brute-force because it spends way too much time doing vector comparisons. BASELINE
Candidate:
|
Yeah, with the bugfix in place, this patch is looking like we don't even need it to be configurable. Let's confirm with more datasets @benchaplin |
Pretty cool! Those numbers are looking worlds better. I have been running some tests with this notion of 'filter correlation to the query' (mikemccand/luceneutil#330). I'll run a benchmark for this new version and post back the results. I think this 'correlation' is important to test as I imagine many real world filters involve some correlation, rather than the random filters we get in luceneutil benchmarks. |
I agree, however, random is also generally useful for:
But, I eagerly await your results. I am going to refactor this assuming we just always have it on at a given threshold (I am leaning towards 60% allowed vectors or lower as being the threshold). |
Baseline:
Candidate:
For me, the main story here is that the candidate's advantage weakens as the query becomes more positively correlated with the filter (towards "1.00" correlation), but never gets worse than the baseline. I think this makes sense, because in this case, once we're in the right small world, almost every neighbor will pass the filter. So "predicate subgraph traversal" = 'normal total traversal' and the theoretical advantage disappears. Recall is bad for -1 correlation, but (recall / visited) is the same as baseline. Also, I'm fairly sure how I've set up -1 correlation (the filter is exactly the vectors with the worst score with respect to the query) is not at all realistic so maybe we can think of those tests as extreme edge-case stress testing. I agree ~0.5 selectivity seems to be a good cutoff for the new algorithm. |
@benchaplin I found another bug. The recall numbers were indeed way too good to be true. I was returning duplicate documents 🤦 . So, recall was great because we contained a valid document many times. I have refactored and fixed multiple things, rerunning locally. I will replicate your findings for correlation. Is there anything else needed to replicate your findings other than your code in the lucene util PR? |
@benwtrent Yep, everything's in the PR. I ran on 1M docs, 100 queries to keep the benchmark under an hour. |
OK, the current implementation is about as good as I can figure it.
However, one thing that bothers me is that increasing
|
OK, I checked the current search, and it seems to have the same issue (increasing |
This is a continuation and completion of the work started by @benchaplin in #14085
The algorithm is fairly simple:
Some of the changes to the baseline Acorn algorithm are:
Here are some numbers for 1M vectors, float32 and then int4 quantized.
https://docs.google.com/spreadsheets/d/1GqD7Jw42IIqimr2nB78fzEfOohrcBlJzOlpt0NuUVDQ
Here is the "nightly" dataset (but I merged to a single segment)
https://docs.google.com/spreadsheets/d/1gk1uybtqleVtDUfhWXActyhW8q_lgG1mlMrOohnJRJA
Since this changes the behavior significantly, and there are still some weird edge cases, I am exposing as a parameter within a new idea called
KnnSearchStrategy
that the collectors can provide. This strategy object can be provided to the queries.closes: #13940