Update config_machines.xml for Feb 18 software update on Frontier. #6977
+17
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Various changes to
config_machines.xml
in preparation for system software updates on Frontier on 2025-02-18. Within thefrontier-scream-gpu
machine:lmod
paths from/usr/share
to/opt/cray/pe
because the/usr/share
software is not maintained and is not available on internal OLCF test computers.craygnuamdgpu
to latest version ofcpe
available on Frontier,cpe/24.11
.craygnuamdgpu
. I used to think this wasn't needed, sincemodule load cpe/24.11
should set all the appropriate defaults. I recently discovered that this setting of defaults only works if themodule load cpe/24.11
is a separate command before the othermodule load
s. Cime uses a singlemodule load
with a list of modules, which does not update the defaults. I selected the explicit versions based on thecpe/24.11
defaults, including the default forrocm
,rocm/6.2.4
.libfabric
module fromcraygnuamdgpu
. On Feb 18, the defaultlibfabric
will change fromlibfabric/1.20.1
tolibfabric/1.22.0
, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed thelibfabric
module fromcraygnuamdgpu
so that it will stay with the default when it changes.libfabric/1.15.2.0
tolibfabric/1.20.1
forcrayclang-scream
. On Feb 18,libfabric/1.15.2.0
will disappear from Frontier. And the new default,libfabric/1.22.0
, does not support Cray MPI versions beforecray-mpich/8.1.28
. Becausecrayclang-scream
is stuck back atcpe/22.12
andcray-mpich/8.1.26
, we have to uselibfabric/1.20.1
. We previously switched fromlibfabric/1.20.1
tolibfabric/1.15.2.0
because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean thatcrayclang-scream
will be using a version oflibfabric
that isn't officially supported with thelibfabric/1.22.0
drivers, mais c'est la vie.I ran an 8-node test of NE30 on Frontier and on our internal system (for
libfabric/1.22.0
) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported ine3sm_timing.t.*
.cpe
rocm
libfabric
CPM:ATM_RUN
crayclang-scream
craygnuamdgpu
It would probably be good to confirm with NE1024 runs.