Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRT replication should make it possible/easy to use bite-sized commits #14219

Open
mikemccand opened this issue Feb 11, 2025 · 0 comments · May be fixed by #14325
Open

NRT replication should make it possible/easy to use bite-sized commits #14219

mikemccand opened this issue Feb 11, 2025 · 0 comments · May be fixed by #14325

Comments

@mikemccand
Copy link
Member

Description

At Amazon (product search) we use Lucene's awesome near-real-time segment replication to efficiently distribute index changes (through S3) to searchers.

This "index once search many" approach works very well for applications needing high QPS scale. It enables physical isolation of indexing and searching (so e.g. heavy merges cannot affect searching). It enables rapid failover or proactive cluster load-balancing if an indexer crashes or its node is too hot: just promote one of the replicas. And it means you should frontload lots of indexing cost if it make searching even a bit faster.

But recently we've been struggling with too-large checkpoints. We ask the indexer to write a new checkpoint (an IndexWriter commit call) every 60 seconds, and searchers copy down the checkpoint and light the new segments. During periods of heavy index updates ("update storms"), combined with our very aggressive TieredMergePolicy configuration to reclaim deletes, we see a big write amplification (bytes copied to replicas vs bytes written for newly indexed documents), sometimes sending many 10s of GB new segments in a single checkpoint.

When replicas copy these large checkpoints, it can induce heavy page faults on the hot query path for in-flight queries (we suspect the MADV_RANDOM hint for KNN files to also exacerbate things for us -- this is good for cold indices, but maybe not mostly hot ones?) since the hot searching pages evicted by copying in the large checkpoint before any of the new segments are lit puts RAM pressure on the OS. We could maybe tune the OS to more aggressively move dirty pages to disk? Or maybe try O_DIRECT when copying the new checkpoint files. But still when we then light the new segments, we'll hit page faults then too.

We had an a-ha moment on how to fix this, using APIs Lucene already exposes! We just need to decouple committing from checkpointing/replicating. Instead of committing/replicating every 60 seconds, ask Lucene to commit much more frequently (say once per second, like OpenSearch/Elasticsearch default I think, or maybe "whenever > N GB segments turnover", though this is harder). Configure a time-based IndexDeletionPolicy so these commit points all live for a long time (say an hour). Then, every 60 seconds (or whatever your replication interval is), replicate all new commit points (and any segment files referenced by these commit points) out to searchers.

The searchers can then carefully pick and choose which commit points they want to switch too, in a bite sized / stepping stone manner, ensuring that each commit point they light has < N GB turnover in the segments, meaning the OS will only ever need "hot-pages plus N" GB of working RAM. This leans nicely on Lucene's strongly transactional APIs, and I think it's largely sugar / utility classes in NRT replicator that we'd need to add to demonstrate this approach, maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant