Skip to content

Conversation

mccullocht
Copy link
Contributor

@mccullocht mccullocht commented Jul 21, 2025

Description

The bulk scoring method accepts a list of ordinals and fills a list of floats with the scores.

Use bulk scoring in the HNSW codec:

  • In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring.
  • During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring.

This change does not provide a Panama implementation of this method for common vector types as I was
not able to write an implementation that yielded non-trivial benefits for most processors, but there are other
cases that may benefit from this:

  • Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs.
  • Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.

See #14013

@ChrisHegarty
Copy link
Contributor

@mccullocht Thanks for putting this together 🙏 I've been noodling a little with this too, trying to find a concrete improvement for Lucene as well as a shape for the bulk scoring API. I added some benchmarks, etc, in this PR #14980, which we can use to evaluate if/how Lucene itself can take advantage of this too.

@mccullocht
Copy link
Contributor Author

@ChrisHegarty I have an attempt to push things bulk scoring into Lucene as well in another branch: https://github.com/mccullocht/lucene/tree/bulk-vector-scorer-panama. I pushed the array all the way down rather than trying to dictate the number of vectors scored in bulk, but ultimately scored 4 at a time in the inner loop. I got a 10-12% improvement on an M2 Mac but it was not any better on Graviton 2 or 3 processors. There's a theory that the Apple processors decode instructions much further ahead than other aarch64 processors so they are able to prefetch/load ahead and feed in data faster.

I want to be able to prove that having this interface allows a faster implementation before marking it ready for review so I may put together a crude/incomplete FlatVectorScorer and sandbox codec that uses a native implementation for prefetching but omit it from this PR. For larger vectors (1024d+) prefetching vector N+1 when scoring vector N is quite effective in my tests, but there's no way to represent this in Java.

@mccullocht
Copy link
Contributor Author

I ran a macrobenchmark on 1M 3072 dim vectors or ~12GB using angular similarity and k=128 on an M2 Mac:
Baseline (lucene main): 0.005832984s avg latency
Experiment (native RandomVectorAccessor + prefetching): 0.003986186s avg latency

The improvement is about 30%. I think this improvement will be durable on other processors because I inlined prefetching in the dot computation so I'm not relying on the CPU to decode instructions way ahead and reorder:

    /// Score `q` against `d`; prefetch the vector for `p` (next to score).
    pub fn score_and_prefetch(&self, q: &[f32], d: &[f32], p: &[f32]) -> f32 {
        match self {
            Self::DotProduct => {
                let dot = unsafe {
                    let mut dot = vdupq_n_f32(0.0);
                    for i in (0..q.len()).step_by(4) {
                        if i % 16 == 0 {
                            _prefetch::<_PREFETCH_READ, _PREFETCH_LOCALITY3>(
                                p.as_ptr().add(i) as *const i8
                            );
                        }
                        let qv = vld1q_f32(q.as_ptr().add(i));
                        let dv = vld1q_f32(d.as_ptr().add(i));
                        dot = vfmaq_f32(dot, qv, dv);
                    }

                    vaddvq_f32(dot)
                };
                0.0f32.max((1.0f32 + dot) / 2.0f32)
            }
            _ => self.score(q, d),
        }
    }

I will try to stand this up on a graviton host and confirm.

@mccullocht mccullocht marked this pull request as ready for review July 24, 2025 16:43
* @param scores output array of scores corresponding to each node.
* @param numNodes number of nodes to score. Must not exceed length of nodes or scores arrays.
*/
default void bulkScore(int[] nodes, float[] scores, int numNodes) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ this signature is more flexible than what I've been iterating on so far. I'm going to update my experiments to use similar.

@ChrisHegarty
Copy link
Contributor

The high-level goal is to determine the ideal signature to enable optimized bulk scoring implementations, while not hurting existing out-of-the-box performance of pure Lucene.

Additionally, it would be nice to improve the Panama vector scoring in Lucene using this api, but that can be worked on separately (as I've been experimenting with in a separate PR #14980) . Lastly, we should consider bulk scoring for exact search too, so that it is not negatively affected, and could in fact potentially even have an alternative implementation - not that we'd provide one per se, just that such would be possible.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@mccullocht
Copy link
Contributor Author

@ChrisHegarty started using bulk scoring for exhaustive search in 66b9b6f. I had been thinking about whether or not we should do this based on density but that's a bit prescriptive and altering behavior based on the density of the ordinals is something that implementors may choose to deal with.

I had run lucenebench comparing this branch to main and the results were neutral; I will rerun and include the results in the review thread.

@mccullocht
Copy link
Contributor Author

I ran lucenebench on the cohere dataset with an index of 1M float vectors merged to a single segment. I then ran 10k queries 20 times and looked at average latency.

baseline: 3.855ms +/- 0.088
candidate: 3.859ms +/- 0.081

I also modified my test native codec to use RandomVectorScorer.bulkScore without using the native HNSW traversal code and stood things up on a graviton 2 box. This yielded a ~15% reduction in latency.

@mccullocht mccullocht requested a review from ChrisHegarty July 29, 2025 16:22
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@ChrisHegarty
Copy link
Contributor

This is looking very good. I'm just doing some additional testing and benchmarking before final review.

Additionally, I added a unit test for the bulk scorer, that verifies bulk scores and non-bulks are the same. While the default implementation is trivial now, this may not always be the case. Also, such testing can be extended to cover custom bulk scorer implementations in ones own repo.

@github-actions github-actions bot added this to the 10.3.0 milestone Jul 30, 2025
@ChrisHegarty
Copy link
Contributor

I added a note to the 10.3 API section of the change log, since we'll likely backport this.

@ChrisHegarty
Copy link
Contributor

All my investigations and benchmarks show that this API can be used to improve the performance of vector search, whether that be within Lucene itself or custom scorers (potentially written in other languages). This is now ready to merge.

I'll follow up with concrete proposal for Lucene separately, which uses this API along with an off-heap scorer and Panama implementation.

@ChrisHegarty ChrisHegarty merged commit 71d4ad6 into apache:main Jul 31, 2025
8 checks passed
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated LGTM - REVIEWED.

@mccullocht mccullocht deleted the bulk-vector-scorer branch July 31, 2025 15:32
}

if (numOrds > 0) {
scorer.bulkScore(ords, scores, numOrds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for bulkScore I think it would be beneficial for the API to return maxScore. This way the collection and iteration can be skipped if the best score isn't competitive.

I realize this "complicates" the incVisitedCount, but I think that can be fixed by pulling up knnCollector.incVisitedCount(numOrds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChrisHegarty pushed a commit that referenced this pull request Aug 6, 2025
This commit adds a bulk scoring API to RandomVectorScorer. The API accepts a list of ordinals and fills a list of floats with the scores.

Use bulk scoring in the HNSW codec:

In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring.
During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring.
This change does not provide a Panama implementation of this method for common vector types as I was
not able to write an implementation that yielded non-trivial benefits for most processors, but there are other
cases that may benefit from this:
 * Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs.
 * Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.
jpountz pushed a commit to shubhamvishu/lucene that referenced this pull request Aug 10, 2025
This commit adds a bulk scoring API to RandomVectorScorer. The API accepts a list of ordinals and fills a list of floats with the scores.

Use bulk scoring in the HNSW codec:

In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring.
During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring.
This change does not provide a Panama implementation of this method for common vector types as I was
not able to write an implementation that yielded non-trivial benefits for most processors, but there are other
cases that may benefit from this:
 * Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs.
 * Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants