Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

trey-ornl
Copy link
Contributor

Various changes to config_machines.xml in preparation for system software updates on Frontier on 2025-02-18. Within the frontier-scream-gpu machine:

  • Change lmod paths from /usr/share to /opt/cray/pe because the /usr/share software is not maintained and is not available on internal OLCF test computers.
  • Update craygnuamdgpu to latest version of cpe available on Frontier, cpe/24.11.
  • Add explicit versions to other modules in craygnuamdgpu. I used to think this wasn't needed, since module load cpe/24.11 should set all the appropriate defaults. I recently discovered that this setting of defaults only works if the module load cpe/24.11 is a separate command before the other module loads. Cime uses a single module load with a list of modules, which does not update the defaults. I selected the explicit versions based on the cpe/24.11 defaults, including the default for rocm, rocm/6.2.4.
  • Remove the libfabric module from craygnuamdgpu. On Feb 18, the default libfabric will change from libfabric/1.20.1 to libfabric/1.22.0, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed the libfabric module from craygnuamdgpu so that it will stay with the default when it changes.
  • Change from libfabric/1.15.2.0 to libfabric/1.20.1 for crayclang-scream. On Feb 18, libfabric/1.15.2.0 will disappear from Frontier. And the new default, libfabric/1.22.0, does not support Cray MPI versions before cray-mpich/8.1.28. Because crayclang-scream is stuck back at cpe/22.12 and cray-mpich/8.1.26, we have to use libfabric/1.20.1. We previously switched from libfabric/1.20.1 to libfabric/1.15.2.0 because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean that crayclang-scream will be using a version of libfabric that isn't officially supported with the libfabric/1.22.0 drivers, mais c'est la vie.

I ran an 8-node test of NE30 on Frontier and on our internal system (for libfabric/1.22.0) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported in e3sm_timing.t.*.

Compiler cpe rocm libfabric CPM:ATM_RUN
crayclang-scream 22.12 5.4.0 1.15.2.0 15.855: 16.571
1.20.1 15.885: 16.683
craygnuamdgpu 24.07 6.2.0 1.15.2.0 14.850: 15.594
24.11 6.2.4 1.15.2.0 14.727: 15.644
1.20.1 14.729: 15.447
1.22.0 14.990: 15.601

It would probably be good to confirm with NE1024 runs.

@grnydawn
Copy link
Contributor

grnydawn commented Feb 6, 2025

@jgfouca , Just to check, do you think crayclang-scream needs any changes due to this system software updates on Frontier?

@jgfouca
Copy link
Member

jgfouca commented Feb 6, 2025

@grnydawn my guess, looking at the module changes, is no.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 6, 2025

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

@trey-ornl
Copy link
Contributor Author

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

The chart above shows that the various configurations have similar performance for a short NE30 run using 8 nodes. If you have NE256 and NE1024 runs to suggest, then I'm happy to try them, though they may be too big to test libfabric/1.22.0 on the internal system.

I compared BFB using bfbhash> 234 on the NE30 runs.

  • Changing crayclang-scream from libfabric/1.15.2.0 to libfabric/1.20.1 was BFB.
  • Changing craygnuamdgpu from cpe/24.07+rocm/6.2.0 to cpe/24.11+rocm/6.2.4, both with libfabric/1.15.2.0, was not BFB.
  • Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.15.20.0 to libfabric/1.20.1 was BFB.
  • Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.20.1 on Frontier to libfabric/1.22.0 on our internal computer was not BFB.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 7, 2025

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants