Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we encode postings the way we encode monotonic long doc values? #12477

Open
mikemccand opened this issue Jul 31, 2023 · 6 comments
Open

Comments

@mikemccand
Copy link
Member

Description

Lucene has an efficient (storage and CPU) compressor for monotonic long values, that simply makes "best fit" (ish?) linear model to the N monotonic values, and then encodes the precise error signal (positive and negative) of each value. I think we use it for doc values in certain cases? Or maybe only for in-memory data structures?

Lucene's block postings is different: it encodes the docid delta between each document. But because postings are encoded this way, we have a linear/chained data dependency and must sequentially add up all the docid deltas to get the true docid of each of the 128 postings in the block.

Could we change postings to instead encode with the linear fit? We'd maybe lose some compression (having to store negative and positive -- ZInt), but then decoding could be done concurrently with simple SIMD math, and then skipping might be able to do a binary search within the block?

I know this (efficient block int[] encoding for SIMD decode) is a well/heavily studied area in the literature :) I'm sure there is already a good option on Daniel Lemire's blog somewhere!

It's not so simple, though, because we also accumulate the freq of each posting so we can know where we are in the positions/payloads block space.

@mikemccand
Copy link
Member Author

Note that Tantivy uses binary search to locate the target docid in the block of docs -- somehow Tantivy uses SIMD to decode (docid-delta encoded) postings into absolute docids first, and then binary search within that block.

@tang-hi
Copy link
Contributor

tang-hi commented Jul 31, 2023

I have attempted to encode/decode the post block using SIMD instructions. However, I believe it may not be the opportune moment to vectorize it. This is because we are currently unable to generate scalar code that can match the performance of the existing code. You can find more details in this pull request. #12417

@mikemccand
Copy link
Member Author

This is because we are currently unable to generate scalar code that can match the performance of the existing code

I'm confused by this: don't we already have the scalar code today (our current gen'd FOR implementation that Hotspot autovectorizes well) that we could fallback to? Or is the problem that the Panama APIs don't make it easy for us to determine that we should fallback to our impl?

I'll try to catch up on that long discussion. Thanks for the pointer @tang-hi!

@mikemccand
Copy link
Member Author

Daniel Lemire's awesome paper "Decoding billions of integers per second through vectorization" should surely be helpful, once we can wrestle Panama into being honest/transparent about the SIMD capabilities of the underlying CPU.

@tang-hi
Copy link
Contributor

tang-hi commented Aug 1, 2023

don't we already have the scalar code today (our current gen'd FOR implementation that Hotspot autovectorizes well) that we could fallback to?

This is because in the current version, we have implemented some tricks to enable automatic vectorization in the JVM, which makes the existing file format not very friendly for SIMD usage. Therefore, we have two options:

  1. Vectorize the existing format, but it won't result in significant improvement.
  2. Use a new format that is more SIMD-friendly (can achieve greater improvement), but performance will significantly decrease when using scalar operations.

@mikemccand
Copy link
Member Author

Thanks @tang-hi -- I think this is a great place to leverage Lucene's Codec API :) We could create an experimental (NOT the default, no backwards compatibility guarantee, etc.) PostingsFormat that is SIMD optimized, which brave users/apps could use if they know they are on CPUs that support the latest & greatest SIMD instructions, and iterate on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants