-
Notifications
You must be signed in to change notification settings - Fork 158
Trigger GC when catch MemoryUsageExceedException on the second retry #3766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Lantao Jin <[email protected]>
public boolean shouldFail() { | ||
return ThreadLocalRandom.current().nextBoolean(); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If always return false, we should remove FastFail class and MemoryUsageExceedFastFailureException.
.onRetry( | ||
event -> { | ||
if (event.getNumberOfRetryAttempts() == 1) { | ||
System.gc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that triggering System.gc() may impact other OpenSearch operations, such as indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the fix still work if we did gc on the second retry?
Since we have exponential backoff, we could probably mitigate a lot of the load by letting some retries play out, and then we won't see 20+ GCs for this query load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the fix still work if we did gc on the second retry?
Since we have exponential backoff, we could probably mitigate a lot of the load by letting some retries play out, and then we won't see 20+ GCs for this query load.
Seems better, reduced to 7 GCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember SoftReference has default GC behavior in JVM when JVM is close to OOM. Just curious what's the behavior of not explicitly calling System.gc()?
Description
When we do the benchmark of merging thousands of complex indices: when the cluster meets higher concurrency (e.g. 64) and complex scenario (e.g. 1000 indices with 15 depth), it will throw out of memory.
In fact, this OOM issue is not solely caused by insufficient heap memory. The primary reason is "lazy" heap memory GC under high concurrency, leading to numerous query failures because OpenSearchResourceMonitor#isHealthy() returned true.
Changes in this PR:
SoftReference
, allowing the cache to be prioritized for garbage collection when memory is low.OpenSearchResourceMonitor
retry uponMemoryUsageExceedException
, triggeringSystem.gc()
on the second retry attempt.Benchmarking:
before: total 100 queries in 64 concurrent threads, 72 fails, 0 FullGC
after: total 100 queries in 64 concurrent threads, 0 ~ 5 fails, 23 FullGC (Trigger GC on the first retry)
after: total 100 queries in 64 concurrent threads, 0 ~ 3 fails, 7 FullGC (Trigger GC on the second retry)
Related Issues
Resolves #3750
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.