Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in JDK 21 processing tests #17429

Open
gianm opened this issue Oct 28, 2024 · 13 comments
Open

Segfault in JDK 21 processing tests #17429

gianm opened this issue Oct 28, 2024 · 13 comments
Labels

Comments

@gianm
Copy link
Contributor

gianm commented Oct 28, 2024

We've been seeing a high degree of segfaults recently in the processing unit tests under JDK 21. Here's an example: https://github.com/apache/druid/actions/runs/11528594002/job/32166241444?pr=17414. The common thread is an error like this:

Current CompileTask:
C2:1780380 144967       4       org.apache.druid.query.filter.InDimFilter::optimizeLookup (744 bytes)

Stack: [0x00007f9c045fb000,0x00007f9c046fb000],  sp=0x00007f9c046f6660,  free space=1005k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xf15629]  BoolNode::Ideal(PhaseGVN*, bool)+0x19
V  [libjvm.so+0xd6258e]  PhaseIterGVN::transform_old(Node*)+0x9e
V  [libjvm.so+0x6ba360]  Conv2BNode::Ideal(PhaseGVN*, bool)+0x180
V  [libjvm.so+0xd6258e]  PhaseIterGVN::transform_old(Node*)+0x9e
V  [libjvm.so+0xd5e4a9]  PhaseIterGVN::optimize()+0xf9
V  [libjvm.so+0x6712cf]  Compile::Optimize()+0xf9f
V  [libjvm.so+0x672a06]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0xf26
V  [libjvm.so+0x598acb]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x18b
V  [libjvm.so+0x678ee4]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xca4
V  [libjvm.so+0x67c0d8]  CompileBroker::compiler_thread_loop()+0x6a8
V  [libjvm.so+0x931f40]  JavaThread::thread_main_inner()+0x1e0
V  [libjvm.so+0xf86b88]  Thread::call_run()+0xa8
V  [libjvm.so+0xd103ba]  thread_native_entry(Thread*)+0xda

Things that did not help include:

Since this is in the C2 compiler, and only happens on JDK 21, it seems likely to be a JDK bug rather than something we're doing wrong.

@pranavbhole has raised a support case with Azul at https://support.azul.com/hc/en-us/requests/65354. We don't have any special support contract with them, so hopefully they respond out of the goodness of their hearts. Since this issue happens on Corretto too, it's possibly a broader OpenJDK issue, so we could also consider raising a bug with OpenJDK.

@gianm gianm added the Bug label Oct 28, 2024
@asdf2014
Copy link
Member

Hi @gianm , it appears that a similar issue has already been reported and resolved in JDK-8323190. You may want to confirm this by checking the details in the issue report 😄

@pranavbhole
Copy link
Contributor

I received a response from Azul regarding the Zulu JDK crash. I informed them that the issue is consistently reproducible (still occurring with the latest master commit). Azul has requested a core dump as well as list of shared libraries to analyze the crash.

There is 
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/runner/work/druid/druid/processing/core.3533)
and 
rlimit (soft/hard): STACK 16384k/infinity , CORE 0k/infinity , NPROC 63899/63899 , NOFILE 65536/65536 , AS infinity/infinity , CPU infinity/infinity , DATA infinity/infinity , FSIZE infinity/infinity , MEMLOCK 8192k/8192k
in hs_err file, that tells there might be a core dump in the system generated after crash happened.
In case you manage to locate core file please collect the shared libs as well.
You can use this command 
grep '/.*\.so' hs_err_pid1913.log | awk '{ print $NF }' | uniq > libs.txt
Then run the following command on the resulting libs.txt file from above
tar -czf libs.tgz -T libs.txt
 
Upload core and libs.tgz to the file sharing resource and your convenience and share the link to download.

@pranavbhole
Copy link
Contributor

Looks like we have setting in place to check if core.pid file dumped but unfortunately core dumps are not being generated and aborted due to memory limitations.

@kgyrtkirk
Copy link
Member

this issue was blocking me for a few days now - so I was forced to take a look ;)

so far my observations are:

  • the issue did happened even earlier with different GHA ubuntu images
  • its nondeterministic; I've seen it happen with 2-3 tests; the most common victim is AndFilterTest ; but it could be SelectorFilterTest as well - so its not related to just one testclass
  • I've blind-tuned the jit to be more agressive...right now I'm not sure if that even mattered
  • I was able to repro it locally after I've switched to OpenJDK Runtime Environment Zulu21.38+21-CA (build 21.0.5+11-LTS)
    • I was running 21.0.4 earlier which was unaffected with several attempts
    • switching to 21.0.5 right away showed the issue; so it must be introduced in 21.0.5
  • the issue consistently happens with org.apache.druid.query.filter.InDimFilter::optimizeLookup
  • my attempts to get a disassembly so far was unsuccessfull; I was bumping into that I'm not prepared for it as hsdis-amd64.so is missing
  • all tests run in a single jvm -> jvm retains method invocation counts from earlier tests
  • surefire test execution order is unspecified - which makes surefire run them in filesystem order (which becomes essentially a random in CI runs)
  • right now the shortest running command which is still able to repro the issue is:
time mvn install -pl processing/ -Dtest=AndFilterTest,SuperSorterTest*,InFilterTests,*FilterTest -Pskip-static-checks
  • even thru the test execution order is the same I've seen crashes in 3 different tests (NotFilterTest ; AndFilterTest ; SelectorFilterTest)

I think right now we have the following options:

  • debug it further
    • even if we uncover the underlying issue - a new jdk release will be needed to get rid of the problem ; which might take some time...so we will need an alternate fix until that will happen
  • disable the affected tests: not really an option since the C2 optimization of InDimFilter::optimizeLookup may trigger the issue
  • from the call stack it seem like its connected to the C2 compiler - so disabling C2 optimization for tests on 21 might also be a possible way out
  • possibly force 21.0.4 to be used
  • try to not reuse the jvm for subsequent test runs (reuseforks)
  • other options?

@gianm
Copy link
Contributor Author

gianm commented Nov 5, 2024

@kgyrtkirk shared a core dump and libs archive with me offline. I've attached them to the Azul support case.

@gianm
Copy link
Contributor Author

gianm commented Nov 6, 2024

disabling C2 optimization for tests on 21 might also be a possible way out

This seems like the best temporary approach. I raised #17456 to attempt it.

@gianm
Copy link
Contributor Author

gianm commented Nov 6, 2024

Now a quidem test is failing with:

*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message transform method call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 884

Maybe JDK 21 is too buggy and we should switch off its unit tests for now 😛

@abhishekrb19
Copy link
Contributor

Switching the JDK distribution to temurin in #17449 seems to not have the segfaults seen in the JDK 21 tests, at least in the four repeated CI runs. Any thoughts on switching the distribution from zulu to temurin? This is similar to #17426, but the temurin build appears to help.

cc: @gianm @kgyrtkirk

@gianm
Copy link
Contributor Author

gianm commented Nov 7, 2024

Switching the JDK distribution to temurin in #17449 seems to not have the segfaults seen in the JDK 21 tests, at least in the four repeated CI runs. Any thoughts on switching the distribution from zulu to temurin?

The logs in #17449 say the JDK version is 21.0.4+7. Given we only started seeing this problem with Zulu in 21.0.5 (21.0.4 was fine), it's probably just working because the temurin JDK happens to be older right now. But that might change. I guess it will work for some time though, until the temurin build upgrades.

gianm added a commit to gianm/druid that referenced this issue Nov 7, 2024
To work around apache#17429, run our JDK 21 workflows with
version 21.0.4. It does not appear to have this problem.
@gianm
Copy link
Contributor Author

gianm commented Nov 7, 2024

Trying to pin to JDK 21.0.4: #17458

@abhishekrb19
Copy link
Contributor

Ah got it, I didn't notice the difference in JDK update version with temurin. Pinning the version to JDK 21.0.4 makes sense.

@kgyrtkirk
Copy link
Member

fyi: I've run locally temurin 21.0.5 - and it did produced the issue; so we won't be out of the water with it...I should have closed the PR around that time....

but I've encountered a 21.0.5 which did passed these tests: using the openjdk-21-jdk package shipped with debian/trixie I was not able to reproduce the issue (this doesn't necessary mean its not there; but there is a possibility that it was patched)

I think we should go to 21.0.4 until it gets fixed

gianm added a commit that referenced this issue Nov 7, 2024
* Run JDK 21 workflows with 21.0.4.

To work around #17429, run our JDK 21 workflows with
version 21.0.4. It does not appear to have this problem.

* Undo changes in standard-its.yml

* Add comments.

---------

Co-authored-by: Zoltan Haindrich <[email protected]>
@gianm
Copy link
Contributor Author

gianm commented Nov 7, 2024

"Fixed" for now in #17458. It's not really fixed for good, this is just a workaround. Hopefully the Azul folks come through for us. If not, it sounds like we'd be able to raise this issue with various other JDK suppliers, as we've seen it in others too.

jtuglu-netflix pushed a commit to jtuglu-netflix/druid that referenced this issue Nov 20, 2024
* Run JDK 21 workflows with 21.0.4.

To work around apache#17429, run our JDK 21 workflows with
version 21.0.4. It does not appear to have this problem.

* Undo changes in standard-its.yml

* Add comments.

---------

Co-authored-by: Zoltan Haindrich <[email protected]>
clintropolis pushed a commit to clintropolis/druid that referenced this issue Nov 22, 2024
* Run JDK 21 workflows with 21.0.4.

To work around apache#17429, run our JDK 21 workflows with
version 21.0.4. It does not appear to have this problem.

* Undo changes in standard-its.yml

* Add comments.

---------

Co-authored-by: Zoltan Haindrich <[email protected]>
clintropolis added a commit that referenced this issue Nov 22, 2024
* Run JDK 21 workflows with 21.0.4.

To work around #17429, run our JDK 21 workflows with
version 21.0.4. It does not appear to have this problem.

* Undo changes in standard-its.yml

* Add comments.

---------

Co-authored-by: Gian Merlino <[email protected]>
Co-authored-by: Zoltan Haindrich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants