Skip to content

Revert "fix: extend nccl timeout" #515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Apr 30, 2025

Reverts #507

@mergify mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing ci-failure labels Apr 30, 2025
Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCO will need to be fixed.


For posterity: the revert is because:

  • we identified a different root cause for training failures we see in CI - insufficient disc space due to FSDP mixed precision using 32-bit parameters for interrim checkpoints).
  • the timeout setting would delay failure of watchdog in case of legit issues - which is not desirable.
  • there are reports that the setting of timeout= didn't take an effect anyway, for reasons we don't know yet.
  • the new AMI for CI job was updated to use a larger 2TB volume enough to store all checkpoints.
  • There may be a number of follow-ups to the identified issue with expanded storage consumption: at the very least we may need to inform users and customers about the new requirements; ideally we also switch back to FP16, or introduce sanity checks prior to going into training phase - making sure enough storage is available to the process.

In the meantime, the due course is to expand volume size (already done) in CI; and to revert the timeout hack that in the first place had questionable story behind it - to avoid unnecessary changes prior to release.

@mergify mergify bot added the one-approval label Apr 30, 2025
@booxter booxter force-pushed the revert-507-nccl-timeout branch 2 times, most recently from 4ba0f47 to 276c8fa Compare April 30, 2025 14:56
@mergify mergify bot removed the ci-failure label Apr 30, 2025
@booxter booxter force-pushed the revert-507-nccl-timeout branch from 276c8fa to 810946f Compare April 30, 2025 14:57
@mergify mergify bot added the ci-failure label Apr 30, 2025
@booxter
Copy link
Contributor

booxter commented Apr 30, 2025

I fixed DCO to speed things up.

@mergify mergify bot added ci-failure and removed ci-failure labels Apr 30, 2025
@booxter
Copy link
Contributor

booxter commented Apr 30, 2025

Smoke failure is tracked here: #505

@booxter booxter requested a review from JamesKunstle April 30, 2025 21:42
@mergify mergify bot removed the ci-failure label Apr 30, 2025
@booxter booxter requested a review from RobotSail May 1, 2025 00:44
@@ -566,15 +565,10 @@ def main(args):
model_conf = AutoConfig.from_pretrained(args.model_name_or_path)
args.model_type = model_conf.model_type

# solution discovered from torchtune https://github.com/pytorch/torchtune/issues/2093
# gets converted to a timedelta of 1:40:00 if the default is kept
nccl_timeout = int(os.getenv("INSTRUCTLAB_NCCL_TIMEOUT_MS", "6000000"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I think it's worthwhile to keep this as, in particular it's very useful when debugging distributed training code. It might make more sense to just change the default here to be the same default as what NCCL uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think folks reported that the new timeout did not actually take effect in CI runs. I don't know specifics, but if we are going to keep this, we should probably confirm that the timeout change actually takes an effect. (For this, we may go the other way and crank it down to 1ms and confirm that it crashed the training run.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@booxter I'm very skeptical of that claim. We found that the problem wasn't timeouts but rather that writing will eventually break once the EC2 node runs out of storage and the error will appear to be a NCCL failure. So I would bet on the cause of issue being that rather than NCCL timeout not being respected.

I've used this exact setting in my own debugging many times and have had it work perfectly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's try this then: #521

And testing timeout works here: #520

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is it might still be useful to keep this in for debug purposes, especially when doing local development. I vote that we keep this and just change the default timeout to the same one used by NCCL.

@booxter booxter added the hold label May 2, 2025
@booxter booxter self-requested a review May 6, 2025 21:34
Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are keeping the variable, just adjusting its behavior a bit to be more friendly to torch defaults + rename it.

@mergify mergify bot removed the one-approval label May 6, 2025
Copy link
Contributor

mergify bot commented May 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 12, 2025
@booxter booxter closed this May 21, 2025
@booxter booxter deleted the revert-507-nccl-timeout branch May 22, 2025 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation hold needs-rebase testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants