You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My cursory read-through of the codebase leads to the understanding that TreeWidth is on the CustomSlurmSettings' Deny list due to clusters deployed with the ec2hostnames DNS configuration. As in that case, the TreeWidth must be set to MAX_VALUE to keep slurmctl directly communicating with each node (as opposed to utilzing fanout).
This results in the Managed/Domain DNS deployments having TreeWidth forcibly set to '30.' On the one hand, this value seems too low for small clusters that may want to avoid faults introduced by relying on dynamic node fanout (which has caused issues as recently as pcluster 3.9.0-3.9.2 due to cloud_dns bug(s) in slurm 23.11.4-23.11.7). On the other hand, this value seems too high for clusters looking to efficiently use hierarchical messaging in clusters running below 4000 nodes (resulting in more threads being spun up/resources used by slurmctld than are necessary).
Allowing users to configure TreeWidth (for Managed/Domain DNS configured clusters) grants them the ability to balance fanout based on their clusters fault tolerance/scalability needs.
(Note: Can currently set TreeWidth by suppressing the CustomSlurmSettingsValidator, but that turning that validator off comes with decent risk).
The text was updated successfully, but these errors were encountered:
My cursory read-through of the codebase leads to the understanding that TreeWidth is on the CustomSlurmSettings' Deny list due to clusters deployed with the ec2hostnames DNS configuration. As in that case, the TreeWidth must be set to MAX_VALUE to keep slurmctl directly communicating with each node (as opposed to utilzing fanout).
This results in the Managed/Domain DNS deployments having TreeWidth forcibly set to '30.' On the one hand, this value seems too low for small clusters that may want to avoid faults introduced by relying on dynamic node fanout (which has caused issues as recently as pcluster 3.9.0-3.9.2 due to cloud_dns bug(s) in slurm 23.11.4-23.11.7). On the other hand, this value seems too high for clusters looking to efficiently use hierarchical messaging in clusters running below 4000 nodes (resulting in more threads being spun up/resources used by slurmctld than are necessary).
Allowing users to configure TreeWidth (for Managed/Domain DNS configured clusters) grants them the ability to balance fanout based on their clusters fault tolerance/scalability needs.
(Note: Can currently set TreeWidth by suppressing the CustomSlurmSettingsValidator, but that turning that validator off comes with decent risk).
The text was updated successfully, but these errors were encountered: