-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV when running gatk/4.6.0.0 #8988
Comments
Thank you for the detailed report. I passed this information to our partners at intel who maintain that library Do you know if this is only happening with this particular input file? (i.e. |
Another question. Which distribution of the JVM are you running? We use and have sometimes seen issues with other distributions of Java 17. If you could try running the same job using our standard docker environment that might provide additional information. |
I've asked and they seeming do have success with other files. As for Java, we use OpenJDK downloaded from https://jdk.java.net/
Because it's a shared cluster, we aren't able to run Docker directly. But I attempted converting it in to a Singularity container and it didn't crash in the same way, but the job did end up failing. Logs are as follows - For the "bare metal" known-crashing conditions (AMD-based machine), the final lines of the output are:
When running on singularity (AMD-based machine):
(No further output) Something else that I'm noticing, is that on our Intel machines, the crash happens at I also notice that the Singularity container uses a slightly different version of Java, 17.0.9. I'll see about getting/building a newer version of Java and attempting to run gatk and report back with any findings :) |
Usually when we see silent failures like what's happening in singularity, it's due to an out of memory error that results in the JVM process being rudely killed before it can output an error message. It's possible that's what's happening there. If you're running in a container with a limited memory pool, you have to be sure to set the java memory explicitly with -Xmx, but also be sure to leave some memory left over for the system and for native code invoked by java. For example, if you have a container with 8G of memory available I would set -Xmx7g to leave a bit of overhead available. I think trying with a newer release of java 17 is a good idea. |
Java 17.0.12 from Oracle seems to display the same behavior.
I also attempted running within a singularity container, allocating 64GB of memory to the job and specifying -Xmx60G. Still seemed to silently "crash". Command I ran was:
|
@emwjacobson I'm sorry I don't have a better solution. It seems likely it's some sort of input data specific crash in the GKL. If the user encounters the problem at the same point in the file everytime I would recommend that they work around the crash by excluding the approximate location by using the I've reported the issue to our collaborators at Intel but there are currently some structural changes happening on their end so it might not be resolved quickly. They could run the problematic segment using the native java HMM It's also possible that the non-OMP version of the hmm might not hit the same issue. They could try with |
* Opening on behalf of a user on an HPC cluster, my knowledge in this field is a bit limited.
Affected tool(s) or class(es)
gatk HaplotypeCaller
Affected version(s)
Latest 4.6.0.0 release
Description
When running command, ~16 hours into the run the program crashes. Below is the start of the Java error report file
Steps to reproduce
The command ran was
Submitted to an HPC cluster using Slurm. Multiple machines tested, one Intel with an Xeon CPU E5-2683 v4 CPU and additionally tested on AMD with an EPYC 7713 CPU.
This has also been run multiple times, all crashing at the same
__memset_avx2_erms+0x11
instruction.Other package versions that might be relevant:
java/17.0.2
glibc-common-2.28-225
If any more information is needed from me or the user, please let me know :)
The text was updated successfully, but these errors were encountered: