[GR-67169] Support JFR emergency dumps on out of memory #11530

roberttoyonaga · 2025-07-02T14:15:08Z

Related issue: #10600

Summary

One of JFR's primary goals is to provide insight in the event of a crash like OOME. JFR data may be useful for investigating OOME. For example, JFR's CPU and allocation profiling can help locate where problem areas might be occurring. JFR's garbage collection events and thread data could also be helpful with diagnosing problems.

Currently, it's possible to receive heap dumps on out of memory (OOM) but this is not yet possible for JFR. OpenJDK has this feature and we should try to implement it in Native Image too.

Goals

Add support for emergency dumps on OOME
Add support for the jdk.DumpReason event
Re-work the existing JFR code to make flushing completely allocation free.

Non-Goals

Add support for jdk.Shutdown Events (will be done in a follow up PR).
Perform JFR emergency dumps on VM crashes. Focus on OOME for now.
Change the existing JFR infrastructure other than to make more parts allocation free.

Details

This PR can be broken into two main parts: (1) making JFR flushing allocation free and (2) creating the emergency dump file.

(1) Making JFR flushing allocation free

Many small changes had to be made to make the JFR flushing procedure allocation free:

JfrChunkFileWriter#writeString adapted to use native memory
JfrSerializer classes pre-initialize a small amount of data while in hosted mode.
for loops needed to be changed from for (Object name : names) format to for (int i=0; i<names.length; i++) format
Some visitor patterns and lambdas were replaced.
The SecondsNanos class was made into a RawStructure so it could be allocated on the stack.

Larger changes to the JfrTypeRepository and JfrSymbolRepository were also required. The general procedure used by the JfrTypeRepository remains the same, but we cannot use Package, Module, and Classloader classes directly because their methods may allocate and they are not pinned objects referenceable from AbstractUninterruptibleHashtable. To work around this, I've made JfrTypeInfo RawStructures corresponding to each of these Java classes (PackageInfoRaw, ModuleInfoRaw etc.). Some type data such as package names must be manually computed to avoid allocation (see setPackageNameAndLength). In some cases, serialization of symbols to native memory buffers must happen earlier (in JfrTypeRepository instead of JfrSymbolRepository) in order to avoid allocating new Java Strings. The JfrSymbolRepository has been modified accordingly to cache pointers to serialized data rather than String objects. The regular Java hash maps have been replaced with new implementations of AbstractUninterruptibleHashtable as well.

One large obstacle was that JfrTypeRepository#collectTypeInfo originally needed to walk the image heap and allocate a list of loaded classes. That process is not easy to make allocation free. To work around this, I experimented with pre-allocating the loaded class list at start-up but found that this negatively affected startup times. My solution was to make the JfrTypeRepository function more similarly to the other JFR repositories in SubstrateVM by maintaining previous/current epochData. Specifically, during event emission, JfrTypeRepository#getClassId now caches the class constant data used by events. Types used by JFR are stored in previous/current epoch data hash tables. This uses some more memory than the old approach, but at least it avoids allocation and is consistent with other the JFR repositories in SubstrateVM. This is a lazy approach so it avoids the start up penalty of pre-allocation.

A small bug in JfrTypeRepository was fixed. The bootstrap classloader was originally not being serialized to chunks. Hotspot gives this classloader the reserved ID of 0 and serializes it if it was tagged during the epoch.

(2) Creating the emergency dump file

New classes implementing this support: JfrEmergencyDumpSupport, JfrEmergencyDumpFeature, PosixJfrEmergencyDumpSupport. I have tried to keep the components and logic as similar as possible to Hotspot class JfrEmergencyDump found in jfrEmergencyDump.cpp.

After the emergency dump flush has completed, the JFR disk repository directory is scanned. The names of chunk files are gathered and sorted (which also implicitly sorts them chronologically). Each chunkfile in the sorted list is then copied to the emergency dump snapshot.

A lot of the work in PosixJfrEmergencyDumpSupport involves handling/creating filenames as C strings. Similar to Hotspot JFr, a pre-allocated native memory path buffer is used as a temporary place to construct filenames and paths.

Hotspot JFR uses quicksort to handle sorting chunk filenames. In SubstrateVM, a Java quick sort implementation has been added to GrowableWordArrayAccess to sort chunk files while avoiding using the Java heap.

…ated class list.

…mp purge in-flight data bug. checkstyle gate fixes cleanup

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/JfrEmergencyDumpFeature.java

zapster · 2025-07-24T11:53:20Z

@roberttoyonaga I started reviewing (or rather playing around with) this PR. One thing that I noticed is that in the OOM case, the dumps do not contain stack traces. Is that something that is expected?

In comparison, running the same example on the base commit, I see them:

roberttoyonaga · 2025-07-24T19:12:50Z

I started reviewing (or rather playing around with) this PR. One thing that I noticed is that in the OOM case, the dumps do not contain stack traces. Is that something that is expected?

Hi @zapster, yes that's a good point. Right now, only a modified JFR flush happens at the start of the emergency dump (not a full safepoint + chunk rotation). There are 2 problems that prevent us from being able to write out stacktraces on flushes:

Races on the JfrStacktraceRepository epoch data buffer.
Cannot safely process thread-local active buffers of other threads without holding the global threads mutex / safepoint

The same problems prevent the JFR Event Streaming feature from writing stacktraces on flushes. I assumed it would problematic to enqueue a safepoint when the VM is trying to shutdown/crash, which is why I'm avoiding a full-fledged chunk rotation during the emergency dump. Although maybe I am wrong about this and safepoints would fine?

After thinking about it more, the races (on the JfrStacktraceRepository epoch data buffer) might be inconsequential since we are shutting down the VM and ending JFR. One question: Is it safe to acquire the global threads lock upon crash/shutdown? We would need this as well in order to process the thread-local stacktrace buffers without a safepoint. @christianhaeubl

The better long-term solution is to decouple sampler buffers from their respective threads (similar to how we implement regular event data buffers). Then we wouldn't need a safepoint or the global threads mutex to process them. There's an old PR up for that here: #6365

In comparison, running the same example on the base commit, I see them:

This is because, the JFR recording is ending normally (with a full chunk rotation). On OOME after OutOfMemoryUtil.reportOutOfMemoryError is called, SubstrateVM tries to finish exiting, which includes calling the JDK-level shutdown hooks which end JFR. However, this is just a best-effort attempt and is not guaranteed to complete successfully because allocation can still happen (either in the JDK-level JFR code or at the SubstrateVM level).

zapster · 2025-07-28T08:19:18Z

Right now, only a modified JFR flush happens at the start of the emergency dump (not a full safepoint + chunk rotation).

This is because, the JFR recording is ending normally (with a full chunk rotation). On OOME after OutOfMemoryUtil.reportOutOfMemoryError is called, SubstrateVM tries to finish exiting, which includes calling the JDK-level shutdown hooks which end JFR. However, this is just a best-effort attempt and is not guaranteed to complete successfully because allocation can still happen (either in the JDK-level JFR code or at the SubstrateVM level).

Got it, thanks for the explanation. To summarize my understanding: in the current implementation, OOME emergency flush only produces a limited dump (for various reasons). However, it flushes all collected data, even the data that it did not process due these limitations. One the other hand, the shutdown hook is able to do a full dump, but might fail due to e.g. an OOME. Furthermore, in the situation where we did an emergency dump, the data is already flushed so the shutdown hook has no more interesting data to dump resulting in an limited dump.

Assuming that the above is correct, we are facing a tradeoff: Either we have an emergency dump with limited information that is very likely to succeed, or we try our luck with the shutdown hook that might fail, but if it succeeds we have a rich dump.

As long as we have that tradeoff, I believe that we should give users a choice which approach to take. So adding an option to enable/disable emergency dumping seems like a good idea. Not sure whether it is sufficient to make it a hosted option or if we want to support it at run time. Also, not sure about the default behavior. @roberttoyonaga do you have some intuition how often the shutdown flush does not succeed in practice? @christianhaeubl what is your take?

PS: How is the situation on Hotspot? Are the emergency dump limited there as well? Are there similar tradeoffs?

christianhaeubl · 2025-07-28T10:12:09Z

I think we should do the following:

The OutOfMemoryError emergency dump should do a full dump. Assuming that all the dumping is now already allocation free, it shouldn't be a problem to do the dumping in a VM operation (we would only need to use a NativeVMOperation instead of a JavaVMOperation to avoid that we allocate any Java heap memory for the VM operation object itself - see HeapDumpOperation for an example).
When an OutOfMemoryError is thrown (e.g., due to a memory leak), then there is currently no guarantee that all the shutdown hooks will be executed. I also don't think that we can guarantee this but we should at least make this code more similar to the JDK and add a try/catch block that ignores exceptions around each shutdown hook call in RuntimeSupport.executeHooks (similar to java.lang.Shutdown.runHooks()). But this can be a separate PR.
Dumping JFR on an actual crash (instead of an OutOfMemoryError) is a completely different story though and much harder to do as we don't know the state of the VM (e.g., we can't reliably execute complex code - even starting a VM operation could already result in a deadlock).
I don't think that we need any new options.

roberttoyonaga · 2025-07-28T13:56:02Z

Hi @zapster, that's correct, there would be a trade-off in its current state.

Do you have some intuition how often the shutdown flush does not succeed in practice

I can't say for sure, but when I've tested it I was able to get it to succeed pretty consistently. But there's no guarantee that some Java code doesn't allocate too much and the dump fails.

PS: How is the situation on Hotspot? Are the emergency dump limited there as well? Are there similar tradeoffs?

The problem I described is specific to how sampler buffers are handled in SVM. I don't think Hotspot has this problem.

--

@christianhaeubl

The OutOfMemoryError emergency dump should do a full dump.

Great, doing a full chunk rotation would make things a lot better.

Dumping JFR on an actual crash (instead of an OutOfMemoryError) is a completely different story

Maybe we could still make a best effort attempt and instead settle for a flush without stacktraces. We can leave this out of the scope of the current PR though.

--

I'll switch to doing a full chunk rotation. Thank you for your feedback everyone.

allocation free

81617ee

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 2, 2025

roberttoyonaga added 11 commits July 2, 2025 15:11

Compiling dump file is working

46e2d11

fix classloader issue

be0cd0a

type repo refactor

b062580

improve hashing for packages

2006521

make pathBuffer allocation free. Clean up

7e79ee6

JfrTypeRepository record classes upon event emission. Remove prealloc…

bfe91e7

…ated class list.

add hook next to heap dump code. Only process full sampler buffers.

e912a0b

qsort chunk files. Bug fix traceID check.

01b11b9

open emergency dump chunk

30b036a

Do markChunkFinal and patch header like regular rotatios. Add tests.

a831599

Emit old object sample and jdk.DumpReason events. Fix testEmergencyDu…

533da7a

…mp purge in-flight data bug. checkstyle gate fixes cleanup

roberttoyonaga force-pushed the emergency-dump branch 2 times, most recently from 1b11e12 to 3d1e62b Compare July 2, 2025 20:18

roberttoyonaga added native-image redhat-interest native-image-jfr labels Jul 2, 2025

roberttoyonaga force-pushed the emergency-dump branch 2 times, most recently from 6b0208c to 905d59d Compare July 3, 2025 15:40

style

847d6fb

roberttoyonaga force-pushed the emergency-dump branch from 905d59d to 847d6fb Compare July 3, 2025 16:26

roberttoyonaga marked this pull request as ready for review July 3, 2025 17:50

roberttoyonaga requested a review from christianhaeubl July 3, 2025 17:50

christianhaeubl requested a review from zapster July 4, 2025 08:36

zapster changed the title ~~Support JFR emergency dumps on out of memory~~ [GR-67169] Support JFR emergency dumps on out of memory Jul 4, 2025

zapster reviewed Jul 7, 2025

View reviewed changes

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/JfrEmergencyDumpFeature.java Outdated Show resolved Hide resolved

roberttoyonaga added 3 commits July 7, 2025 11:09

removed unused method

cbeda3e

style

62bf8a9

style

785f415

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GR-67169] Support JFR emergency dumps on out of memory #11530

[GR-67169] Support JFR emergency dumps on out of memory #11530

Uh oh!

roberttoyonaga commented Jul 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

zapster commented Jul 24, 2025

Uh oh!

roberttoyonaga commented Jul 24, 2025 •

edited

Loading

Uh oh!

zapster commented Jul 28, 2025

Uh oh!

christianhaeubl commented Jul 28, 2025

Uh oh!

roberttoyonaga commented Jul 28, 2025

Uh oh!

Uh oh!

[GR-67169] Support JFR emergency dumps on out of memory #11530

Are you sure you want to change the base?

[GR-67169] Support JFR emergency dumps on out of memory #11530

Uh oh!

Conversation

roberttoyonaga commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Goals

Non-Goals

Details

(1) Making JFR flushing allocation free

(2) Creating the emergency dump file

Uh oh!

Uh oh!

zapster commented Jul 24, 2025

Uh oh!

roberttoyonaga commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zapster commented Jul 28, 2025

Uh oh!

christianhaeubl commented Jul 28, 2025

Uh oh!

roberttoyonaga commented Jul 28, 2025

Uh oh!

Uh oh!

roberttoyonaga commented Jul 2, 2025 •

edited

Loading

roberttoyonaga commented Jul 24, 2025 •

edited

Loading