You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This "index once search many" approach works very well for applications needing high QPS scale. It enables physical isolation of indexing and searching (so e.g. heavy merges cannot affect searching). It enables rapid failover or proactive cluster load-balancing if an indexer crashes or its node is too hot: just promote one of the replicas. And it means you should frontload lots of indexing cost if it make searching even a bit faster.
But recently we've been struggling with too-large checkpoints. We ask the indexer to write a new checkpoint (an IndexWriter commit call) every 60 seconds, and searchers copy down the checkpoint and light the new segments. During periods of heavy index updates ("update storms"), combined with our very aggressive TieredMergePolicy configuration to reclaim deletes, we see a big write amplification (bytes copied to replicas vs bytes written for newly indexed documents), sometimes sending many 10s of GB new segments in a single checkpoint.
When replicas copy these large checkpoints, it can induce heavy page faults on the hot query path for in-flight queries (we suspect the MADV_RANDOM hint for KNN files to also exacerbate things for us -- this is good for cold indices, but maybe not mostly hot ones?) since the hot searching pages evicted by copying in the large checkpoint before any of the new segments are lit puts RAM pressure on the OS. We could maybe tune the OS to more aggressively move dirty pages to disk? Or maybe try O_DIRECT when copying the new checkpoint files. But still when we then light the new segments, we'll hit page faults then too.
We had an a-ha moment on how to fix this, using APIs Lucene already exposes! We just need to decouple committing from checkpointing/replicating. Instead of committing/replicating every 60 seconds, ask Lucene to commit much more frequently (say once per second, like OpenSearch/Elasticsearch default I think, or maybe "whenever > N GB segments turnover", though this is harder). Configure a time-based IndexDeletionPolicy so these commit points all live for a long time (say an hour). Then, every 60 seconds (or whatever your replication interval is), replicate all new commit points (and any segment files referenced by these commit points) out to searchers.
The searchers can then carefully pick and choose which commit points they want to switch too, in a bite sized / stepping stone manner, ensuring that each commit point they light has < N GB turnover in the segments, meaning the OS will only ever need "hot-pages plus N" GB of working RAM. This leans nicely on Lucene's strongly transactional APIs, and I think it's largely sugar / utility classes in NRT replicator that we'd need to add to demonstrate this approach, maybe.
The text was updated successfully, but these errors were encountered:
Description
At Amazon (product search) we use Lucene's awesome near-real-time segment replication to efficiently distribute index changes (through S3) to searchers.
This "index once search many" approach works very well for applications needing high QPS scale. It enables physical isolation of indexing and searching (so e.g. heavy merges cannot affect searching). It enables rapid failover or proactive cluster load-balancing if an indexer crashes or its node is too hot: just promote one of the replicas. And it means you should frontload lots of indexing cost if it make searching even a bit faster.
But recently we've been struggling with too-large checkpoints. We ask the indexer to write a new checkpoint (an
IndexWriter
commit call) every 60 seconds, and searchers copy down the checkpoint and light the new segments. During periods of heavy index updates ("update storms"), combined with our very aggressiveTieredMergePolicy
configuration to reclaim deletes, we see a big write amplification (bytes copied to replicas vs bytes written for newly indexed documents), sometimes sending many 10s of GB new segments in a single checkpoint.When replicas copy these large checkpoints, it can induce heavy page faults on the hot query path for in-flight queries (we suspect the
MADV_RANDOM
hint for KNN files to also exacerbate things for us -- this is good for cold indices, but maybe not mostly hot ones?) since the hot searching pages evicted by copying in the large checkpoint before any of the new segments are lit puts RAM pressure on the OS. We could maybe tune the OS to more aggressively move dirty pages to disk? Or maybe tryO_DIRECT
when copying the new checkpoint files. But still when we then light the new segments, we'll hit page faults then too.We had an a-ha moment on how to fix this, using APIs Lucene already exposes! We just need to decouple committing from checkpointing/replicating. Instead of committing/replicating every 60 seconds, ask Lucene to commit much more frequently (say once per second, like OpenSearch/Elasticsearch default I think, or maybe "whenever > N GB segments turnover", though this is harder). Configure a time-based
IndexDeletionPolicy
so these commit points all live for a long time (say an hour). Then, every 60 seconds (or whatever your replication interval is), replicate all new commit points (and any segment files referenced by these commit points) out to searchers.The searchers can then carefully pick and choose which commit points they want to switch too, in a bite sized / stepping stone manner, ensuring that each commit point they light has < N GB turnover in the segments, meaning the OS will only ever need "hot-pages plus N" GB of working RAM. This leans nicely on Lucene's strongly transactional APIs, and I think it's largely sugar / utility classes in NRT replicator that we'd need to add to demonstrate this approach, maybe.
The text was updated successfully, but these errors were encountered: