Revert "fix: extend nccl timeout" #515

cdoern · 2025-04-30T12:34:10Z

Reverts #507

booxter

DCO will need to be fixed.

For posterity: the revert is because:

we identified a different root cause for training failures we see in CI - insufficient disc space due to FSDP mixed precision using 32-bit parameters for interrim checkpoints).
the timeout setting would delay failure of watchdog in case of legit issues - which is not desirable.
there are reports that the setting of timeout= didn't take an effect anyway, for reasons we don't know yet.
the new AMI for CI job was updated to use a larger 2TB volume enough to store all checkpoints.
There may be a number of follow-ups to the identified issue with expanded storage consumption: at the very least we may need to inform users and customers about the new requirements; ideally we also switch back to FP16, or introduce sanity checks prior to going into training phase - making sure enough storage is available to the process.

In the meantime, the due course is to expand volume size (already done) in CI; and to revert the timeout hack that in the first place had questionable story behind it - to avoid unnecessary changes prior to release.

Signed-off-by: Charlie Doern <[email protected]>

booxter · 2025-04-30T14:57:24Z

I fixed DCO to speed things up.

booxter · 2025-04-30T16:02:37Z

Smoke failure is tracked here: #505

RobotSail · 2025-05-01T03:24:49Z

src/instructlab/training/main_ds.py

@@ -566,15 +565,10 @@ def main(args):
    model_conf = AutoConfig.from_pretrained(args.model_name_or_path)
    args.model_type = model_conf.model_type

-    # solution discovered from torchtune https://github.com/pytorch/torchtune/issues/2093
-    # gets converted to a timedelta of 1:40:00 if the default is kept
-    nccl_timeout = int(os.getenv("INSTRUCTLAB_NCCL_TIMEOUT_MS", "6000000"))


To be honest I think it's worthwhile to keep this as, in particular it's very useful when debugging distributed training code. It might make more sense to just change the default here to be the same default as what NCCL uses.

I think folks reported that the new timeout did not actually take effect in CI runs. I don't know specifics, but if we are going to keep this, we should probably confirm that the timeout change actually takes an effect. (For this, we may go the other way and crank it down to 1ms and confirm that it crashed the training run.)

@booxter I'm very skeptical of that claim. We found that the problem wasn't timeouts but rather that writing will eventually break once the EC2 node runs out of storage and the error will appear to be a NCCL failure. So I would bet on the cause of issue being that rather than NCCL timeout not being respected.

I've used this exact setting in my own debugging many times and have had it work perfectly.

OK let's try this then: #521

And testing timeout works here: #520

RobotSail

My opinion is it might still be useful to keep this in for debug purposes, especially when doing local development. I vote that we keep this and just change the default timeout to the same one used by NCCL.

booxter

We are keeping the variable, just adjusting its behavior a bit to be more friendly to torch defaults + rename it.

mergify · 2025-05-12T20:52:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing ci-failure labels Apr 30, 2025

booxter approved these changes Apr 30, 2025

View reviewed changes

mergify bot added the one-approval label Apr 30, 2025

booxter mentioned this pull request Apr 30, 2025

fix: revert nccl watchdog fixes #514

Closed

booxter force-pushed the revert-507-nccl-timeout branch 2 times, most recently from 4ba0f47 to 276c8fa Compare April 30, 2025 14:56

mergify bot removed the ci-failure label Apr 30, 2025

Revert "fix: extend nccl timeout"

810946f

Signed-off-by: Charlie Doern <[email protected]>

booxter force-pushed the revert-507-nccl-timeout branch from 276c8fa to 810946f Compare April 30, 2025 14:57

mergify bot added the ci-failure label Apr 30, 2025

mergify bot added ci-failure and removed ci-failure labels Apr 30, 2025

booxter requested a review from JamesKunstle April 30, 2025 21:42

mergify bot removed the ci-failure label Apr 30, 2025

booxter requested a review from RobotSail May 1, 2025 00:44

RobotSail reviewed May 1, 2025

View reviewed changes

booxter added the hold label May 2, 2025

booxter self-requested a review May 6, 2025 21:34

booxter requested changes May 6, 2025

View reviewed changes

mergify bot removed the one-approval label May 6, 2025

mergify bot added the needs-rebase label May 12, 2025

booxter closed this May 21, 2025

booxter deleted the revert-507-nccl-timeout branch May 22, 2025 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "fix: extend nccl timeout" #515

Revert "fix: extend nccl timeout" #515

Uh oh!

cdoern commented Apr 30, 2025

Uh oh!

booxter left a comment

Uh oh!

booxter commented Apr 30, 2025

Uh oh!

booxter commented Apr 30, 2025

Uh oh!

RobotSail May 1, 2025

Uh oh!

booxter May 1, 2025

Uh oh!

RobotSail May 2, 2025

Uh oh!

booxter May 2, 2025

Uh oh!

RobotSail left a comment

Uh oh!

booxter left a comment

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

Uh oh!

Revert "fix: extend nccl timeout" #515

Revert "fix: extend nccl timeout" #515

Uh oh!

Conversation

cdoern commented Apr 30, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

booxter commented Apr 30, 2025

Uh oh!

booxter commented Apr 30, 2025

Uh oh!

RobotSail May 1, 2025

Choose a reason for hiding this comment

Uh oh!

booxter May 1, 2025

Choose a reason for hiding this comment

Uh oh!

RobotSail May 2, 2025

Choose a reason for hiding this comment

Uh oh!

booxter May 2, 2025

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

Uh oh!