-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add a bulk scoring interface to RandomVectorScorer #14978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ry point search as benefit would be marginal
@mccullocht Thanks for putting this together 🙏 I've been noodling a little with this too, trying to find a concrete improvement for Lucene as well as a shape for the bulk scoring API. I added some benchmarks, etc, in this PR #14980, which we can use to evaluate if/how Lucene itself can take advantage of this too. |
@ChrisHegarty I have an attempt to push things bulk scoring into Lucene as well in another branch: https://github.com/mccullocht/lucene/tree/bulk-vector-scorer-panama. I pushed the array all the way down rather than trying to dictate the number of vectors scored in bulk, but ultimately scored 4 at a time in the inner loop. I got a 10-12% improvement on an M2 Mac but it was not any better on Graviton 2 or 3 processors. There's a theory that the Apple processors decode instructions much further ahead than other aarch64 processors so they are able to prefetch/load ahead and feed in data faster. I want to be able to prove that having this interface allows a faster implementation before marking it ready for review so I may put together a crude/incomplete |
I ran a macrobenchmark on 1M 3072 dim vectors or ~12GB using angular similarity and k=128 on an M2 Mac: The improvement is about 30%. I think this improvement will be durable on other processors because I inlined prefetching in the dot computation so I'm not relying on the CPU to decode instructions way ahead and reorder: /// Score `q` against `d`; prefetch the vector for `p` (next to score).
pub fn score_and_prefetch(&self, q: &[f32], d: &[f32], p: &[f32]) -> f32 {
match self {
Self::DotProduct => {
let dot = unsafe {
let mut dot = vdupq_n_f32(0.0);
for i in (0..q.len()).step_by(4) {
if i % 16 == 0 {
_prefetch::<_PREFETCH_READ, _PREFETCH_LOCALITY3>(
p.as_ptr().add(i) as *const i8
);
}
let qv = vld1q_f32(q.as_ptr().add(i));
let dv = vld1q_f32(d.as_ptr().add(i));
dot = vfmaq_f32(dot, qv, dv);
}
vaddvq_f32(dot)
};
0.0f32.max((1.0f32 + dot) / 2.0f32)
}
_ => self.score(q, d),
}
} I will try to stand this up on a graviton host and confirm. |
* @param scores output array of scores corresponding to each node. | ||
* @param numNodes number of nodes to score. Must not exceed length of nodes or scores arrays. | ||
*/ | ||
default void bulkScore(int[] nodes, float[] scores, int numNodes) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ this signature is more flexible than what I've been iterating on so far. I'm going to update my experiments to use similar.
The high-level goal is to determine the ideal signature to enable optimized bulk scoring implementations, while not hurting existing out-of-the-box performance of pure Lucene. Additionally, it would be nice to improve the Panama vector scoring in Lucene using this api, but that can be worked on separately (as I've been experimenting with in a separate PR #14980) . Lastly, we should consider bulk scoring for exact search too, so that it is not negatively affected, and could in fact potentially even have an alternative implementation - not that we'd provide one per se, just that such would be possible. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@ChrisHegarty started using bulk scoring for exhaustive search in 66b9b6f. I had been thinking about whether or not we should do this based on density but that's a bit prescriptive and altering behavior based on the density of the ordinals is something that implementors may choose to deal with. I had run lucenebench comparing this branch to main and the results were neutral; I will rerun and include the results in the review thread. |
I ran lucenebench on the cohere dataset with an index of 1M float vectors merged to a single segment. I then ran 10k queries 20 times and looked at average latency. baseline: 3.855ms +/- 0.088 I also modified my test native codec to use |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This is looking very good. I'm just doing some additional testing and benchmarking before final review. Additionally, I added a unit test for the bulk scorer, that verifies bulk scores and non-bulks are the same. While the default implementation is trivial now, this may not always be the case. Also, such testing can be extended to cover custom bulk scorer implementations in ones own repo. |
I added a note to the 10.3 API section of the change log, since we'll likely backport this. |
All my investigations and benchmarks show that this API can be used to improve the performance of vector search, whether that be within Lucene itself or custom scorers (potentially written in other languages). This is now ready to merge. I'll follow up with concrete proposal for Lucene separately, which uses this API along with an off-heap scorer and Panama implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Belated LGTM - REVIEWED.
} | ||
|
||
if (numOrds > 0) { | ||
scorer.bulkScore(ords, scores, numOrds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for bulkScore
I think it would be beneficial for the API to return maxScore
. This way the collection and iteration can be skipped if the best score isn't competitive.
I realize this "complicates" the incVisitedCount
, but I think that can be fixed by pulling up knnCollector.incVisitedCount(numOrds)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit adds a bulk scoring API to RandomVectorScorer. The API accepts a list of ordinals and fills a list of floats with the scores. Use bulk scoring in the HNSW codec: In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring. During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring. This change does not provide a Panama implementation of this method for common vector types as I was not able to write an implementation that yielded non-trivial benefits for most processors, but there are other cases that may benefit from this: * Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs. * Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.
This commit adds a bulk scoring API to RandomVectorScorer. The API accepts a list of ordinals and fills a list of floats with the scores. Use bulk scoring in the HNSW codec: In HnswGraphSearcher, buffer all the ordinals resulting from visiting the top candidate for bulk scoring. During exhaustive search, buffer up to 64 ordinals at a time for bulk scoring. This change does not provide a Panama implementation of this method for common vector types as I was not able to write an implementation that yielded non-trivial benefits for most processors, but there are other cases that may benefit from this: * Native implementations can use hardware specific features like CPU prefetching, matrix multiplication units, or GPUs. * Implementations that serve out of storage can prefetch vectors into memory using bulk APIs.
Description
The bulk scoring method accepts a list of ordinals and fills a list of floats with the scores.
Use bulk scoring in the HNSW codec:
HnswGraphSearcher
, buffer all the ordinals resulting from visiting the top candidate for bulk scoring.This change does not provide a Panama implementation of this method for common vector types as I was
not able to write an implementation that yielded non-trivial benefits for most processors, but there are other
cases that may benefit from this:
See #14013