Add a bulk scoring interface to RandomVectorScorer #14978

mccullocht · 2025-07-21T19:57:42Z

Description

The bulk scoring method accepts a list of ordinals and fills a list of floats with the scores.

Use bulk scoring in the HNSW codec:

In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring.
During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring.

This change does not provide a Panama implementation of this method for common vector types as I was
not able to write an implementation that yielded non-trivial benefits for most processors, but there are other
cases that may benefit from this:

Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs.
Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.

See #14013

…ry point search as benefit would be marginal

ChrisHegarty · 2025-07-22T16:39:01Z

@mccullocht Thanks for putting this together 🙏 I've been noodling a little with this too, trying to find a concrete improvement for Lucene as well as a shape for the bulk scoring API. I added some benchmarks, etc, in this PR #14980, which we can use to evaluate if/how Lucene itself can take advantage of this too.

mccullocht · 2025-07-22T17:09:54Z

@ChrisHegarty I have an attempt to push things bulk scoring into Lucene as well in another branch: https://github.com/mccullocht/lucene/tree/bulk-vector-scorer-panama. I pushed the array all the way down rather than trying to dictate the number of vectors scored in bulk, but ultimately scored 4 at a time in the inner loop. I got a 10-12% improvement on an M2 Mac but it was not any better on Graviton 2 or 3 processors. There's a theory that the Apple processors decode instructions much further ahead than other aarch64 processors so they are able to prefetch/load ahead and feed in data faster.

I want to be able to prove that having this interface allows a faster implementation before marking it ready for review so I may put together a crude/incomplete FlatVectorScorer and sandbox codec that uses a native implementation for prefetching but omit it from this PR. For larger vectors (1024d+) prefetching vector N+1 when scoring vector N is quite effective in my tests, but there's no way to represent this in Java.

mccullocht · 2025-07-24T16:43:48Z

I ran a macrobenchmark on 1M 3072 dim vectors or ~12GB using angular similarity and k=128 on an M2 Mac:
Baseline (lucene main): 0.005832984s avg latency
Experiment (native RandomVectorAccessor + prefetching): 0.003986186s avg latency

The improvement is about 30%. I think this improvement will be durable on other processors because I inlined prefetching in the dot computation so I'm not relying on the CPU to decode instructions way ahead and reorder:

    /// Score `q` against `d`; prefetch the vector for `p` (next to score).
    pub fn score_and_prefetch(&self, q: &[f32], d: &[f32], p: &[f32]) -> f32 {
        match self {
            Self::DotProduct => {
                let dot = unsafe {
                    let mut dot = vdupq_n_f32(0.0);
                    for i in (0..q.len()).step_by(4) {
                        if i % 16 == 0 {
                            _prefetch::<_PREFETCH_READ, _PREFETCH_LOCALITY3>(
                                p.as_ptr().add(i) as *const i8
                            );
                        }
                        let qv = vld1q_f32(q.as_ptr().add(i));
                        let dv = vld1q_f32(d.as_ptr().add(i));
                        dot = vfmaq_f32(dot, qv, dv);
                    }

                    vaddvq_f32(dot)
                };
                0.0f32.max((1.0f32 + dot) / 2.0f32)
            }
            _ => self.score(q, d),
        }
    }

I will try to stand this up on a graviton host and confirm.

ChrisHegarty · 2025-07-25T11:06:53Z

lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorer.java

+   * @param scores output array of scores corresponding to each node.
+   * @param numNodes number of nodes to score. Must not exceed length of nodes or scores arrays.
+   */
+  default void bulkScore(int[] nodes, float[] scores, int numNodes) throws IOException {


++ this signature is more flexible than what I've been iterating on so far. I'm going to update my experiments to use similar.

ChrisHegarty · 2025-07-25T11:11:56Z

The high-level goal is to determine the ideal signature to enable optimized bulk scoring implementations, while not hurting existing out-of-the-box performance of pure Lucene.

Additionally, it would be nice to improve the Panama vector scoring in Lucene using this api, but that can be worked on separately (as I've been experimenting with in a separate PR #14980) . Lastly, we should consider bulk scoring for exact search too, so that it is not negatively affected, and could in fact potentially even have an alternative implementation - not that we'd provide one per se, just that such would be possible.

github-actions · 2025-07-25T17:49:00Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

mccullocht · 2025-07-25T17:50:44Z

@ChrisHegarty started using bulk scoring for exhaustive search in 66b9b6f. I had been thinking about whether or not we should do this based on density but that's a bit prescriptive and altering behavior based on the density of the ordinals is something that implementors may choose to deal with.

I had run lucenebench comparing this branch to main and the results were neutral; I will rerun and include the results in the review thread.

mccullocht · 2025-07-29T16:22:10Z

I ran lucenebench on the cohere dataset with an index of 1M float vectors merged to a single segment. I then ran 10k queries 20 times and looked at average latency.

baseline: 3.855ms +/- 0.088
candidate: 3.859ms +/- 0.081

I also modified my test native codec to use RandomVectorScorer.bulkScore without using the native HNSW traversal code and stood things up on a graviton 2 box. This yielded a ~15% reduction in latency.

github-actions · 2025-07-30T12:06:01Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

ChrisHegarty · 2025-07-30T12:47:49Z

This is looking very good. I'm just doing some additional testing and benchmarking before final review.

Additionally, I added a unit test for the bulk scorer, that verifies bulk scores and non-bulks are the same. While the default implementation is trivial now, this may not always be the case. Also, such testing can be extended to cover custom bulk scorer implementations in ones own repo.

ChrisHegarty · 2025-07-30T12:52:41Z

I added a note to the 10.3 API section of the change log, since we'll likely backport this.

ChrisHegarty · 2025-07-31T10:56:32Z

All my investigations and benchmarks show that this API can be used to improve the performance of vector search, whether that be within Lucene itself or custom scorers (potentially written in other languages). This is now ready to merge.

I'll follow up with concrete proposal for Lucene separately, which uses this API along with an off-heap scorer and Panama implementation.

ChrisHegarty

Belated LGTM - REVIEWED.

benwtrent · 2025-07-31T19:37:02Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java

+      }
+
+      if (numOrds > 0) {
+        scorer.bulkScore(ords, scores, numOrds);


for bulkScore I think it would be beneficial for the API to return maxScore. This way the collection and iteration can be skipped if the best score isn't competitive.

I realize this "complicates" the incVisitedCount, but I think that can be fixed by pulling up knnCollector.incVisitedCount(numOrds).

This commit adds a bulk scoring API to RandomVectorScorer. The API accepts a list of ordinals and fills a list of floats with the scores. Use bulk scoring in the HNSW codec: In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring. During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring. This change does not provide a Panama implementation of this method for common vector types as I was not able to write an implementation that yielded non-trivial benefits for most processors, but there are other cases that may benefit from this: * Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs. * Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.

mccullocht added 2 commits July 11, 2025 14:19

introduce RandomVectorScorer.bulkScore

7beb1d2

utilize bulk scorer in HnswGraphSearcher.searchLevel. skipping in ent…

4c8c70b

…ry point search as benefit would be marginal

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 21, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 21, 2025

github-actions bot added the module:core/hnsw label Jul 21, 2025

mccullocht marked this pull request as ready for review July 24, 2025 16:43

ChrisHegarty reviewed Jul 25, 2025

View reviewed changes

use bulk scorer for exhaustive search

66b9b6f

github-actions bot added the module:core/codecs label Jul 25, 2025

mccullocht requested a review from ChrisHegarty July 29, 2025 16:22

ensure bulk and non-bulk scores are the same.

108fc3d

ChrisHegarty added 2 commits July 30, 2025 13:50

changes

5b55a04

Merge branch 'main' into bulk-vector-scorer

b7fa31f

github-actions bot added this to the 10.3.0 milestone Jul 30, 2025

ChrisHegarty added 2 commits July 30, 2025 14:11

test score through the supplier/updatableScorer interface

64a7e06

more tests

f5e5273

ChrisHegarty merged commit 71d4ad6 into apache:main Jul 31, 2025
8 checks passed

ChrisHegarty reviewed Jul 31, 2025

View reviewed changes

mccullocht deleted the bulk-vector-scorer branch July 31, 2025 15:32

benwtrent reviewed Jul 31, 2025

View reviewed changes

This was referenced Aug 12, 2025

Add AcceptDocs abstraction for accepted KNN docs #15011

Merged

TestLucene99HnswScalarQuantizedVectorsFormat.testSearchWithVisitedLimit failure #15057

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a bulk scoring interface to RandomVectorScorer #14978

Add a bulk scoring interface to RandomVectorScorer #14978

mccullocht commented Jul 21, 2025 •

edited

Loading

Uh oh!

ChrisHegarty commented Jul 22, 2025

Uh oh!

mccullocht commented Jul 22, 2025

Uh oh!

mccullocht commented Jul 24, 2025

Uh oh!

ChrisHegarty Jul 25, 2025

Uh oh!

ChrisHegarty commented Jul 25, 2025

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

mccullocht commented Jul 25, 2025

Uh oh!

mccullocht commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

ChrisHegarty commented Jul 30, 2025

Uh oh!

ChrisHegarty commented Jul 30, 2025

Uh oh!

ChrisHegarty commented Jul 31, 2025

Uh oh!

Uh oh!

ChrisHegarty left a comment

Uh oh!

benwtrent Jul 31, 2025

Uh oh!

mccullocht Aug 1, 2025

Uh oh!

Uh oh!

Add a bulk scoring interface to RandomVectorScorer #14978

Add a bulk scoring interface to RandomVectorScorer #14978

Conversation

mccullocht commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

ChrisHegarty commented Jul 22, 2025

Uh oh!

mccullocht commented Jul 22, 2025

Uh oh!

mccullocht commented Jul 24, 2025

Uh oh!

ChrisHegarty Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty commented Jul 25, 2025

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

mccullocht commented Jul 25, 2025

Uh oh!

mccullocht commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

ChrisHegarty commented Jul 30, 2025

Uh oh!

ChrisHegarty commented Jul 30, 2025

Uh oh!

ChrisHegarty commented Jul 31, 2025

Uh oh!

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

mccullocht Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mccullocht commented Jul 21, 2025 •

edited

Loading