Trigger GC when catch MemoryUsageExceedException on the second retry #3766

LantaoJin · 2025-06-12T10:54:21Z

Description

When we do the benchmark of merging thousands of complex indices: when the cluster meets higher concurrency (e.g. 64) and complex scenario (e.g. 1000 indices with 15 depth), it will throw out of memory.

In fact, this OOM issue is not solely caused by insufficient heap memory. The primary reason is "lazy" heap memory GC under high concurrency, leading to numerous query failures because OpenSearchResourceMonitor#isHealthy() returned true.

Changes in this PR:

Replaces the memory-intensive schema caches with SoftReference, allowing the cache to be prioritized for garbage collection when memory is low.
Makes OpenSearchResourceMonitor retry upon MemoryUsageExceedException, triggering System.gc() on the second retry attempt.

Benchmarking:

before: total 100 queries in 64 concurrent threads, 72 fails, 0 FullGC

after: total 100 queries in 64 concurrent threads, 0 ~ 5 fails, 23 FullGC (Trigger GC on the first retry)

after: total 100 queries in 64 concurrent threads, 0 ~ 3 fails, 7 FullGC (Trigger GC on the second retry)

Related Issues

Resolves #3750

Check List

New functionality includes testing.
New functionality has been documented.
New functionality has javadoc added.
New functionality has a user manual doc added.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <[email protected]>

penghuo · 2025-06-12T15:36:33Z

opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/OpenSearchMemoryHealthy.java

    public boolean shouldFail() {
-      return ThreadLocalRandom.current().nextBoolean();
+      return false;


If always return false, we should remove FastFail class and MemoryUsageExceedFastFailureException.

penghuo · 2025-06-12T15:41:07Z

opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/OpenSearchResourceMonitor.java

+        .onRetry(
+            event -> {
+              if (event.getNumberOfRetryAttempts() == 1) {
+                System.gc();


My concern is that triggering System.gc() may impact other OpenSearch operations, such as indexing.

Would the fix still work if we did gc on the second retry?

Since we have exponential backoff, we could probably mitigate a lot of the load by letting some retries play out, and then we won't see 20+ GCs for this query load.

Would the fix still work if we did gc on the second retry?

Since we have exponential backoff, we could probably mitigate a lot of the load by letting some retries play out, and then we won't see 20+ GCs for this query load.

Seems better, reduced to 7 GCs.

I remember SoftReference has default GC behavior in JVM when JVM is close to OOM. Just curious what's the behavior of not explicitly calling System.gc()?

Signed-off-by: Lantao Jin <[email protected]>

LantaoJin · 2025-06-17T09:33:11Z

in the commit dac4878, I pass the ClusterService to ResourceMonitor and trigger System.gc() only in the dedicated coordinator node. At the same time, change the trigger time to second retry. The GC reduced to 7 from 20+. @penghuo @Swiddis

Trigger GC when catch MemoryUsageExceedException on the first retry

e6d7ca6

Signed-off-by: Lantao Jin <[email protected]>

LantaoJin marked this pull request as ready for review June 12, 2025 11:01

LantaoJin requested review from ps48, kavithacm, derek-ho, joshuali925, dai-chen, YANG-DB, mengweieric, Swiddis, penghuo, seankao-az, MaxKsyunz, Yury-Fridlyand, anirudha, forestmvey, acarbonetto, GumpacG, ykmr1224, noCharger and qianheng-aws as code owners June 12, 2025 11:01

LantaoJin added the performance Make it fast! label Jun 12, 2025

penghuo reviewed Jun 12, 2025

View reviewed changes

LantaoJin added 2 commits June 17, 2025 17:18

Merge remote-tracking branch 'upstream/main' into issues/3750

18639ad

Trigger GC only in dedicated coordinator

dac4878

Signed-off-by: Lantao Jin <[email protected]>

LantaoJin changed the title ~~Trigger GC when catch MemoryUsageExceedException on the first retry~~ Trigger GC when catch MemoryUsageExceedException on the second retry Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trigger GC when catch MemoryUsageExceedException on the second retry #3766

Trigger GC when catch MemoryUsageExceedException on the second retry #3766

Uh oh!

LantaoJin commented Jun 12, 2025 •

edited

Loading

Uh oh!

penghuo Jun 12, 2025

Uh oh!

penghuo Jun 12, 2025

Uh oh!

Swiddis Jun 12, 2025

Uh oh!

LantaoJin Jun 17, 2025

Uh oh!

songkant-aws Jun 23, 2025

Uh oh!

LantaoJin commented Jun 17, 2025

Uh oh!

Uh oh!

Trigger GC when catch MemoryUsageExceedException on the second retry #3766

Are you sure you want to change the base?

Trigger GC when catch MemoryUsageExceedException on the second retry #3766

Uh oh!

Conversation

LantaoJin commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes in this PR:​​

Benchmarking:

Related Issues

Check List

Uh oh!

penghuo Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

penghuo Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Swiddis Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

LantaoJin Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

songkant-aws Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Jun 17, 2025

Uh oh!

Uh oh!

LantaoJin commented Jun 12, 2025 •

edited

Loading

Changes in this PR: