Add new Acorn-esque filtered HNSW search heuristic #14160

benwtrent · 2025-01-22T19:08:21Z

This is a continuation and completion of the work started by @benchaplin in #14085

The algorithm is fairly simple:

Only score and then explore vectors that actually match the filtering criteria
Since this will make the graph even sparser, the search spread is increased to also include the candidate's neighbor neighbors (e.g. generally maxConn * maxConn exploration)
Additionally, even more scored candidates for a given NSW are considered to combat the increased sparsity

Some of the changes to the baseline Acorn algorithm are:

There is some general threshold of filtering that bypasses this algorithm altogether. Default suggestion is 60%.
The number of additional neighbors explored is predicated on the percentage of the immediate neighborhood that is filtered out
More extended neighborhoods will be explored if the filter is exceptionally restrictive. This attempts to find valid vectors to score and explore for every candidate.
Only look at the extended neighbors if less than 90% of the current neighborhood matches the filter.

Here are some numbers for 1M vectors, float32 and then int4 quantized.

https://docs.google.com/spreadsheets/d/1GqD7Jw42IIqimr2nB78fzEfOohrcBlJzOlpt0NuUVDQ

Here is the "nightly" dataset (but I merged to a single segment)

https://docs.google.com/spreadsheets/d/1gk1uybtqleVtDUfhWXActyhW8q_lgG1mlMrOohnJRJA

Since this changes the behavior significantly, and there are still some weird edge cases, I am exposing as a parameter within a new idea called KnnSearchStrategy that the collectors can provide. This strategy object can be provided to the queries.

closes: #13940

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java

benwtrent · 2025-01-24T14:41:29Z

@msokolov I wonder your opinion here?

Do you think the behavior change/result change is worth waiting for a major? I do think folks should be able to use this now, but be able to opt out.

Another option I could think of is injecting a parameter or something directly into SPI loading for the hnsw vector readers. But I am not 100% sure how to do that. It does seem like it should be something that is a "global" configuration for a given Lucene instance instead of one that is provided at query time.

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java

benwtrent · 2025-01-30T16:22:24Z

EDIT: ignore these numbers for candidate. they contain a bug...

I ran this over the "nightly" dataset (8M 768 dim vectors). No force merging. I think this is the nightly behavior. I ran over various filter criteria (I think nightly is 5%).

BASELINE

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 1.000       110.216  8000000   100      50    79846        0.010
 0.982       137.185  8000000   100      50   215393        0.050
 0.974        85.933  8000000   100      50   144953        0.100
 0.965        73.476  8000000   100      50    86333        0.200
 0.958        58.347  8000000   100      50    64055        0.300
 0.952        34.021  8000000   100      50    51634        0.400
 0.944        32.818  8000000   100      50    43643        0.500
 0.940        29.538  8000000   100      50    38200        0.600
 0.936        26.965  8000000   100      50    34205        0.700
 0.930        25.453  8000000   100      50    30989        0.800
 0.926        23.585  8000000   100      50    28482        0.900
 0.924        23.926  8000000   100      50    27318        0.950
 0.922        23.306  8000000   100      50    26481        0.990

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 0.640        28.972  8000000   100      50    10709        0.010
 0.855        34.103  8000000   100      50    20845        0.050
 0.908        37.990  8000000   100      50    36339        0.100
 0.922        47.513  8000000   100      50    54472        0.200
 0.903        46.094  8000000   100      50    56451        0.300
 0.894        41.164  8000000   100      50    52235        0.400
 0.870        30.850  8000000   100      50    36989        0.500
 0.881        28.043  8000000   100      50    34102        0.600
 0.896        27.725  8000000   100      50    33346        0.700
 0.904        25.472  8000000   100      50    31135        0.800
 0.913        23.670  8000000   100      50    26715        0.900
 0.918        23.148  8000000   100      50    26193        0.950
 0.922        22.982  8000000   100      50    26425        0.990

The goal is generally "higher recall with lower visited", a nice single value to show this would be recall/visited, so as visited reduces or recall increases, that value is "higher" so higher is better.

I graphed this ratio (multiplying by 100_000 to make the values saner looking)

So, this shows on nightly, the ratio is significantly improved, by as much as 5x.

I am currently force merging and attempting to re run.

Here is some more data for candidate only at 0.05 filtering with increasing fanout:

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 0.855        29.257  8000000   100      50    20845        0.050
 0.859        30.215  8000000   100      60    21514        0.050
 0.862        31.189  8000000   100      70    22134        0.050
 0.866        31.998  8000000   100      80    22718        0.050
 0.868        32.896  8000000   100      90    23294        0.050
 0.871        33.569  8000000   100     100    23877        0.050
 0.873        29.677  8000000   100     110    24447        0.050
 0.875        34.983  8000000   100     120    24978        0.050
 0.877        34.644  8000000   100     130    25494        0.050
 0.879        36.034  8000000   100     140    26015        0.050
 0.881        36.557  8000000   100     150    26533        0.050
 0.883        36.708  8000000   100     160    27034        0.050
 0.884        36.946  8000000   100     170    27534        0.050
 0.886        38.691  8000000   100     180    27999        0.050
 0.888        39.257  8000000   100     190    28503        0.050
 0.890        39.152  8000000   100     200    28955        0.050
 0.891        40.726  8000000   100     210    29453        0.050
 0.892        41.062  8000000   100     220    29895        0.050
 0.893        40.994  8000000   100     230    30319        0.050
 0.895        41.713  8000000   100     240    30736        0.050
 0.896        42.321  8000000   100     250    31180        0.050

benwtrent · 2025-01-31T12:48:52Z

EDIT: @benchaplin fixed a bug. Now candidate looks way better. We should test on some more datasets, but it seems to me the new way of filter can maybe be done all the way up to 90% filtered out.

I did some more testing, this time single segment of our nightly runs. The recall & latency pattern is much healthier with this change, though the recall is lower. The only reason the recall is so high for the restricted filters is that the baseline over-eagerly drops to brute-force because it spends way too much time doing vector comparisons.

BASELINE

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 1.000       131.763  8000000   100      50    79814        0.010
 0.924        50.518  8000000   100      50    53003        0.050
 0.912        18.970  8000000   100      50    30095        0.100
 0.896        10.697  8000000   100      50    16942        0.200
 0.884         7.509  8000000   100      50    12057        0.300
 0.876         5.763  8000000   100      50     9476        0.400
 0.869         4.792  8000000   100      50     7905        0.500
 0.863         4.184  8000000   100      50     6777        0.600
 0.858         3.781  8000000   100      50     5966        0.700
 0.853         3.403  8000000   100      50     5351        0.800
 0.850         3.084  8000000   100      50     4855        0.900
 0.849         3.044  8000000   100      50     4645        0.950
 0.848         2.927  8000000   100      50     4492        0.990

Candidate:

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 0.923         5.501  8000000   100      50     1471        0.010
 0.980         5.598  8000000   100      50     1286        0.050
 0.986         5.238  8000000   100      50     2001        0.100
 0.981         6.250  8000000   100      50     3158        0.200
 0.984         5.132  8000000   100      50     3742        0.300
 0.978         4.231  8000000   100      50     4354        0.400
 0.965         4.270  8000000   100      50     5018        0.500
 0.967         3.446  8000000   100      50     5689        0.600
 0.960         3.141  8000000   100      50     6265        0.700
 0.946         2.857  8000000   100      50     6766        0.800
 0.904         2.117  8000000   100      50     6898        0.900
 0.872         1.801  8000000   100      50     6954        0.950
 0.866         1.739  8000000   100      50     7153        0.990

benwtrent · 2025-01-31T17:55:44Z

Yeah, with the bugfix in place, this patch is looking like we don't even need it to be configurable. Let's confirm with more datasets @benchaplin

benchaplin · 2025-01-31T18:12:05Z

Pretty cool! Those numbers are looking worlds better.

I have been running some tests with this notion of 'filter correlation to the query' (mikemccand/luceneutil#330). I'll run a benchmark for this new version and post back the results. I think this 'correlation' is important to test as I imagine many real world filters involve some correlation, rather than the random filters we get in luceneutil benchmarks.

benwtrent · 2025-02-03T19:08:53Z

I think this 'correlation' is important to test as I imagine many real world filters involve some correlation, rather than the random filters we get in luceneutil benchmarks.

I agree, however, random is also generally useful for:

Folks indexing multiple client data into the same graph (common for hosted multi-tenant)
Filtering by timestamp
Any amount of deleted docs (deletes are "filters").

But, I eagerly await your results. I am going to refactor this assuming we just always have it on at a given threshold (I am leaning towards 60% allowed vectors or lower as being the threshold).

benchaplin · 2025-02-04T03:58:24Z

Baseline:

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  visited  selectivity  correlation  filterType
 1.000         9.020  1000000   100     100       16        100    10000         0.01        -1.00  pre-filter
 1.000         9.140  1000000   100     100       16        100    10000         0.01        -0.50  pre-filter
 1.000         9.150  1000000   100     100       16        100    10000         0.01         0.00  pre-filter
 0.898         2.850  1000000   100     100       16        100     6189         0.01         0.50  pre-filter
 0.877         1.690  1000000   100     100       16        100     3543         0.01         1.00  pre-filter
 1.000        43.970  1000000   100     100       16        100    50000         0.05        -1.00  pre-filter
 0.997        43.590  1000000   100     100       16        100    49624         0.05        -0.50  pre-filter
 0.960        22.290  1000000   100     100       16        100    39985         0.05         0.00  pre-filter
 0.899         2.860  1000000   100     100       16        100     6191         0.05         0.50  pre-filter
 0.877         1.650  1000000   100     100       16        100     3543         0.05         1.00  pre-filter
 1.000        83.850  1000000   100     100       16        100   100000         0.10        -1.00  pre-filter
 0.936        18.450  1000000   100     100       16        100    38067         0.10        -0.50  pre-filter
 0.930        10.530  1000000   100     100       16        100    22475         0.10         0.00  pre-filter
 0.896         2.670  1000000   100     100       16        100     6037         0.10         0.50  pre-filter
 0.877         1.690  1000000   100     100       16        100     3543         0.10         1.00  pre-filter
 1.000       202.960  1000000   100     100       16        100   250000         0.25        -1.00  pre-filter
 0.923         7.550  1000000   100     100       16        100    16069         0.25        -0.50  pre-filter
 0.913         5.090  1000000   100     100       16        100    10953         0.25         0.00  pre-filter
 0.897         2.750  1000000   100     100       16        100     5826         0.25         0.50  pre-filter
 0.877         1.720  1000000   100     100       16        100     3543         0.25         1.00  pre-filter
 1.000       376.710  1000000   100     100       16        100   500000         0.50        -1.00  pre-filter
 0.904         3.560  1000000   100     100       16        100     7534         0.50        -0.50  pre-filter
 0.904         2.710  1000000   100     100       16        100     6135         0.50         0.00  pre-filter
 0.894         2.410  1000000   100     100       16        100     5324         0.50         0.50  pre-filter
 0.877         1.500  1000000   100     100       16        100     3543         0.50         1.00  pre-filter
 0.939       431.780  1000000   100     100       16        100   706964         0.75        -1.00  pre-filter
 0.884         2.140  1000000   100     100       16        100     4344         0.75        -0.50  pre-filter
 0.887         1.900  1000000   100     100       16        100     4453         0.75         0.00  pre-filter
 0.889         1.960  1000000   100     100       16        100     4519         0.75         0.50  pre-filter
 0.877         1.650  1000000   100     100       16        100     3543         0.75         1.00  pre-filter

Candidate:

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  visited  selectivity  correlation  filterType
0.881         4.300  1000000   100     100       16        100     8876         0.01        -1.00  pre-filter
0.501         4.130  1000000   100     100       16        100     1920         0.01        -0.50  pre-filter
0.653         3.220  1000000   100     100       16        100     1321         0.01         0.00  pre-filter
0.999         2.890  1000000   100     100       16        100     3239         0.01         0.50  pre-filter
0.976         3.000  1000000   100     100       16        100     4199         0.01         1.00  pre-filter
0.652        13.020  1000000   100     100       16        100    32881         0.05        -1.00  pre-filter
0.876         3.530  1000000   100     100       16        100     2172         0.05        -0.50  pre-filter
0.952         3.170  1000000   100     100       16        100     2560         0.05         0.00  pre-filter
0.997         3.370  1000000   100     100       16        100     6134         0.05         0.50  pre-filter
0.892         2.150  1000000   100     100       16        100     4249         0.05         1.00  pre-filter
0.566        19.190  1000000   100     100       16        100    56443         0.10        -1.00  pre-filter
0.961         3.710  1000000   100     100       16        100     2926         0.10        -0.50  pre-filter
0.981         3.730  1000000   100     100       16        100     4261         0.10         0.00  pre-filter
0.988         3.380  1000000   100     100       16        100     6463         0.10         0.50  pre-filter
0.879         1.760  1000000   100     100       16        100     3851         0.10         1.00  pre-filter
0.380        24.090  1000000   100     100       16        100    93297         0.25        -1.00  pre-filter
0.989         4.450  1000000   100     100       16        100     6232         0.25        -0.50  pre-filter
0.993         4.380  1000000   100     100       16        100     7827         0.25         0.00  pre-filter
0.969         3.480  1000000   100     100       16        100     6274         0.25         0.50  pre-filter
0.877         1.730  1000000   100     100       16        100     3569         0.25         1.00  pre-filter
0.144        13.820  1000000   100     100       16        100    66826         0.50        -1.00  pre-filter
0.950         3.420  1000000   100     100       16        100     5770         0.50        -0.50  pre-filter
0.960         3.330  1000000   100     100       16        100     6392         0.50         0.00  pre-filter
0.928         2.960  1000000   100     100       16        100     5328         0.50         0.50  pre-filter
0.877         1.610  1000000   100     100       16        100     3534         0.50         1.00  pre-filter
0.939       439.810  1000000   100     100       16        100   706964         0.75        -1.00  pre-filter
0.886         2.050  1000000   100     100       16        100     4355         0.75        -0.50  pre-filter
0.885         2.020  1000000   100     100       16        100     4485         0.75         0.00  pre-filter
0.890         2.210  1000000   100     100       16        100     4496         0.75         0.50  pre-filter
0.877         1.490  1000000   100     100       16        100     3543         0.75         1.00  pre-filter

For me, the main story here is that the candidate's advantage weakens as the query becomes more positively correlated with the filter (towards "1.00" correlation), but never gets worse than the baseline. I think this makes sense, because in this case, once we're in the right small world, almost every neighbor will pass the filter. So "predicate subgraph traversal" = 'normal total traversal' and the theoretical advantage disappears.

Recall is bad for -1 correlation, but (recall / visited) is the same as baseline. Also, I'm fairly sure how I've set up -1 correlation (the filter is exactly the vectors with the worst score with respect to the query) is not at all realistic so maybe we can think of those tests as extreme edge-case stress testing.

I agree ~0.5 selectivity seems to be a good cutoff for the new algorithm.

benwtrent · 2025-02-04T15:02:28Z

@benchaplin I found another bug. The recall numbers were indeed way too good to be true. I was returning duplicate documents 🤦 . So, recall was great because we contained a valid document many times.

I have refactored and fixed multiple things, rerunning locally.

I will replicate your findings for correlation. Is there anything else needed to replicate your findings other than your code in the lucene util PR?

benchaplin · 2025-02-04T15:12:05Z

@benwtrent Yep, everything's in the PR. I ran on 1M docs, 100 queries to keep the benchmark under an hour.

benwtrent · 2025-02-06T17:52:27Z

OK, the current implementation is about as good as I can figure it.

We explore greater than neighbor-neighbors if we gathered < maxConn/4 vectors to score
We will explore at MAX maxConn*maxConn total vectors

However, one thing that bothers me is that increasing k doesn't guarantee better results. This indicates to me that we take erroneous paths when the score threshold is low (e.g. we haven't gathered enough results).

1M cohere 16 maxConn 100 efConstruction
recall  latency(ms)     nDoc  topK  fanout  visited  selectivity
 0.717        1.340  1000000   100       0     1385        0.050
 0.755        1.790  1000000   100      20     1680        0.050
 0.775        1.950  1000000   100      40     1854        0.050
 0.786        2.270  1000000   100      60     2023        0.050
 0.825        2.560  1000000   100      80     2283        0.050
 0.841        3.160  1000000   100     100     2523        0.050
 0.859        4.030  1000000   100     120     2795        0.050
 0.859        3.810  1000000   100     140     2989        0.050
 0.896        4.220  1000000   100     160     3270        0.050
 0.880        4.550  1000000   100     180     3561        0.050
 0.906        4.670  1000000   100     200     3705        0.050
 0.888        4.810  1000000   100     220     3981        0.050
 0.921        4.820  1000000   100     240     4157        0.050
 0.896        5.700  1000000   100     260     4364        0.050
 0.925        6.140  1000000   100     280     4672        0.050
 0.920        5.380  1000000   100     300     4870        0.050
 0.906        7.050  1000000   100     320     5190        0.050
 0.914        7.390  1000000   100     340     5303        0.050
 0.920        7.570  1000000   100     360     5450        0.050
 0.919        7.420  1000000   100     380     5784        0.050
 0.923        7.550  1000000   100     400     5997        0.050
 0.939        8.080  1000000   100     420     6217        0.050
 0.936        7.170  1000000   100     440     6202        0.050
 0.936        8.550  1000000   100     460     6627        0.050
 0.921       11.800  1000000   100     480     6949        0.050
 0.916        9.630  1000000   100     500     6957        0.050

benwtrent · 2025-02-07T18:45:38Z

OK, I checked the current search, and it seems to have the same issue (increasing k doesn't monotonically increase recall).

benchaplin and others added 16 commits December 20, 2024 13:09

Implement ACORN-1 search for HNSW

046178b

iter

69de431

iter

b5f53cf

iter

280887d

more fixes

e17ebe4

add new searcher to simplify interactions

784ad5f

iter

85879dd

iter

b32a5f8

iter

7e0d983

iter

c16c667

iter

cd21582

iter

54b6dca

iter

83d5003

Merge remote-tracking branch 'upstream/main' into acorn_search

5b7c5a1

iter

23dfed0

fixing javadocs

94c1777

benwtrent added this to the 10.2.0 milestone Jan 22, 2025

adding changes

a0d29e0

benchaplin reviewed Jan 23, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java Outdated Show resolved Hide resolved

benchaplin reviewed Jan 23, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java Outdated Show resolved Hide resolved

benwtrent added 3 commits January 23, 2025 15:23

adjust interface

881db2d

iter

1c8eec3

iter

a34b2f5

benchaplin reviewed Jan 24, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java Show resolved Hide resolved

benchaplin reviewed Jan 24, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java Outdated Show resolved Hide resolved

fixing neighbor iteration

7d47fc6

benwtrent added 3 commits February 3, 2025 15:03

Merge branch 'main' into acorn_search

fb5efe2

iter

69ae08f

iter

c48e0eb

github-actions bot added module:core/search module:core/codecs module:core/hnsw labels Feb 3, 2025

fix

0da6640

benwtrent added 4 commits February 5, 2025 12:41

Merge remote-tracking branch 'upstream/main' into acorn_search

0691eea

iter

185c8e6

adding comments

e52b799

adjusting usage

2eee06e

github-actions bot added the module:join label Feb 6, 2025

Merge remote-tracking branch 'upstream/main' into acorn_search

77b4a3c

benwtrent added 2 commits February 7, 2025 13:34

Merge remote-tracking branch 'upstream/main' into acorn_search

f1ad1f6

fixing filtered search

3d8d526

benwtrent added 5 commits February 7, 2025 13:59

fixing tests

a50b93d

iter

e96977a

format

ce2a95d

iter

3b58f7c

iter

0213e11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Acorn-esque filtered HNSW search heuristic #14160

Add new Acorn-esque filtered HNSW search heuristic #14160

benwtrent commented Jan 22, 2025 •

edited

Loading

benwtrent commented Jan 24, 2025

benwtrent commented Jan 30, 2025 •

edited

Loading

benwtrent commented Jan 31, 2025 •

edited

Loading

benwtrent commented Jan 31, 2025

benchaplin commented Jan 31, 2025

benwtrent commented Feb 3, 2025

benchaplin commented Feb 4, 2025

benwtrent commented Feb 4, 2025

benchaplin commented Feb 4, 2025

benwtrent commented Feb 6, 2025

benwtrent commented Feb 7, 2025

Add new Acorn-esque filtered HNSW search heuristic #14160

Are you sure you want to change the base?

Add new Acorn-esque filtered HNSW search heuristic #14160

Conversation

benwtrent commented Jan 22, 2025 • edited Loading

benwtrent commented Jan 24, 2025

benwtrent commented Jan 30, 2025 • edited Loading

benwtrent commented Jan 31, 2025 • edited Loading

benwtrent commented Jan 31, 2025

benchaplin commented Jan 31, 2025

benwtrent commented Feb 3, 2025

benchaplin commented Feb 4, 2025

benwtrent commented Feb 4, 2025

benchaplin commented Feb 4, 2025

benwtrent commented Feb 6, 2025

benwtrent commented Feb 7, 2025

benwtrent commented Jan 22, 2025 •

edited

Loading

benwtrent commented Jan 30, 2025 •

edited

Loading

benwtrent commented Jan 31, 2025 •

edited

Loading