Skip to content

Trigger GC when catch MemoryUsageExceedException on the second retry #3766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Jun 12, 2025

Description

When we do the benchmark of merging thousands of complex indices: when the cluster meets higher concurrency (e.g. 64) and complex scenario (e.g. 1000 indices with 15 depth), it will throw out of memory.

In fact, this OOM issue is not solely caused by insufficient heap memory. The primary reason is "lazy" heap memory GC under high concurrency, leading to numerous query failures because OpenSearchResourceMonitor#isHealthy() returned true.

Changes in this PR:​​

  1. Replaces the memory-intensive schema caches with SoftReference, allowing the cache to be prioritized for garbage collection when memory is low.​​
  2. Makes OpenSearchResourceMonitor retry upon MemoryUsageExceedException, triggering System.gc() on the second retry attempt.​

Benchmarking:

before: total 100 queries in 64 concurrent threads, 72 fails, 0 FullGC

Screenshot 2025-06-12 at 15 38 17

after: total 100 queries in 64 concurrent threads, 0 ~ 5 fails, 23 FullGC (Trigger GC on the first retry)

Screenshot 2025-06-12 at 18 59 26

after: total 100 queries in 64 concurrent threads, 0 ~ 3 fails, 7 FullGC (Trigger GC on the second retry)

Screenshot 2025-06-17 at 17 28 10

Related Issues

Resolves #3750

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

public boolean shouldFail() {
return ThreadLocalRandom.current().nextBoolean();
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If always return false, we should remove FastFail class and MemoryUsageExceedFastFailureException.

.onRetry(
event -> {
if (event.getNumberOfRetryAttempts() == 1) {
System.gc();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that triggering System.gc() may impact other OpenSearch operations, such as indexing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the fix still work if we did gc on the second retry?

Since we have exponential backoff, we could probably mitigate a lot of the load by letting some retries play out, and then we won't see 20+ GCs for this query load.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the fix still work if we did gc on the second retry?

Since we have exponential backoff, we could probably mitigate a lot of the load by letting some retries play out, and then we won't see 20+ GCs for this query load.

Seems better, reduced to 7 GCs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember SoftReference has default GC behavior in JVM when JVM is close to OOM. Just curious what's the behavior of not explicitly calling System.gc()?

@LantaoJin LantaoJin changed the title Trigger GC when catch MemoryUsageExceedException on the first retry Trigger GC when catch MemoryUsageExceedException on the second retry Jun 17, 2025
@LantaoJin
Copy link
Member Author

in the commit dac4878, I pass the ClusterService to ResourceMonitor and trigger System.gc() only in the dedicated coordinator node. At the same time, change the trigger time to second retry. The GC reduced to 7 from 20+. @penghuo @Swiddis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make it fast!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Memory limitation when merging thousands of complex indices
4 participants