Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify Xmx and Xms values according to Cromwell's ${MEM_SIZE} and ${MEM_UNIT} variables #481

Open
michaelgatzen opened this issue Oct 27, 2021 · 3 comments

Comments

@michaelgatzen
Copy link
Contributor

As described in some JIRA issues, the use of only the -Xms argument to the Java commands cause crashes, especially in the HaplotypeCaller_GATK4_VCF task in GermlineVariantDiscovery.wdl. This has been fixed using in #197 by supplying the -Xmx argument as well, which should be 1 GB smaller than the available memory on the machine. The available memory of the machine can differ from the value specified in the inputs though because of Cromwell's Retry with more memory feature. Therefore, we need to query the available memory. In the PR mentioned above, this has been done using the free command, however, the output format of this command can differ between container images, therefore, it is not very robust. Cromwell supplies the bash environment variables ${MEM_SIZE} and ${MEM_UNIT} for this purpose, unfortunately they are only available in the scope of the command section and not in the WDL section of the task. If we want to rely on these variables for determining the Java memory size, this code could be used (in this example, in the beginning of the command section of the HaplotypeCaller task):

    # We need at least 1 GB of available memory outside of the Java heap in order to execute native code, thus, limit
    # Java's memory by the total memory minus 1 GB. We need to obtain the machine memory from the MEM_SIZE and MEM_UNIT
    # environment variables as it might differ from the input variable because of Cromwell's retry with more memory feature.
    case ${MEM_UNIT} in
      TB)
        memory_exponent=2
        ;;
      GB)
        memory_exponent=1
        ;;
      MB)
        memory_exponent=0
        ;;
      KB)
        memory_exponent=-1
        ;;
      B|Bytes)
        memory_exponent=-2
        ;;
      *)
        echo Error: The MEM_UNIT environment variable has an unexpected value. >&2
        exit 1
    esac

    available_memory_mb=$(echo "print(int(${MEM_SIZE}*(1024**(${memory_exponent}))))" | python)
    let java_memory_size_mb=available_memory_mb-1024
    echo Total available memory: ${available_memory_mb} MB >&2
    echo Memory reserved for Java: ${java_memory_size_mb} MB >&2

    gatk --java-options "-Xmx${java_memory_size_mb}m -Xms${java_memory_size_mb}m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
...

This code relies on Python 3 being installed in the container image, but it can probably be written in a way that only relies on bash (or very common Linux commands). It is not very elegant, but this is the most robust implementation.

If we want to leverage the MEM_SIZE and MEM_UNIT variables, this code should be included before every (GATK/Picard) Java command in the pipeline.

@jonn-smith
Copy link

jonn-smith commented Nov 18, 2021

Can I suggest a simpler way to do this:

avail_mem_KB=$( cat /proc/meminfo | grep MemFree | awk '{print $2}' )
tot_mem_KB=$( cat /proc/meminfo | grep MemTotal | awk '{print $2}' )

let available_memory_mb=${avail_mem_KB}/1024
let java_memory_size_mb=${tot_mem_KB}/1024

echo Total available memory: ${available_memory_mb} MB >&2
echo Memory reserved for Java: ${java_memory_size_mb} MB >&2

gatk --java-options "-Xmx${java_memory_size_mb}m -Xms${java_memory_size_mb}m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \

/proc/meminfo is hard-coded to be in KB so we can always assume that's the unit.

This will remove the switch statement, but you'll still need this boilerplate in every WDL…

@michaelgatzen
Copy link
Contributor Author

Just documenting in this thread that in last week's discussion the point was made that while Jonn's solution is great for cloud runs, it does not work when running Cromwell on a system with shared resources. Although I wonder if the docker image (and therefore the /proc/meminfo file) would still only ever see the memory that was made available to it, and therefore it would work? @kachulis @jessicaway

@kevinpalis
Copy link
Contributor

This work is being tracked here: https://broadworkbench.atlassian.net/browse/PD-1856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants