-
Notifications
You must be signed in to change notification settings - Fork 64
Revert "fix: extend nccl timeout" #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DCO will need to be fixed.
For posterity: the revert is because:
- we identified a different root cause for training failures we see in CI - insufficient disc space due to FSDP mixed precision using 32-bit parameters for interrim checkpoints).
- the timeout setting would delay failure of watchdog in case of legit issues - which is not desirable.
- there are reports that the setting of timeout= didn't take an effect anyway, for reasons we don't know yet.
- the new AMI for CI job was updated to use a larger 2TB volume enough to store all checkpoints.
- There may be a number of follow-ups to the identified issue with expanded storage consumption: at the very least we may need to inform users and customers about the new requirements; ideally we also switch back to FP16, or introduce sanity checks prior to going into training phase - making sure enough storage is available to the process.
In the meantime, the due course is to expand volume size (already done) in CI; and to revert the timeout hack that in the first place had questionable story behind it - to avoid unnecessary changes prior to release.
4ba0f47
to
276c8fa
Compare
Signed-off-by: Charlie Doern <[email protected]>
276c8fa
to
810946f
Compare
I fixed DCO to speed things up. |
Smoke failure is tracked here: #505 |
@@ -566,15 +565,10 @@ def main(args): | |||
model_conf = AutoConfig.from_pretrained(args.model_name_or_path) | |||
args.model_type = model_conf.model_type | |||
|
|||
# solution discovered from torchtune https://github.com/pytorch/torchtune/issues/2093 | |||
# gets converted to a timedelta of 1:40:00 if the default is kept | |||
nccl_timeout = int(os.getenv("INSTRUCTLAB_NCCL_TIMEOUT_MS", "6000000")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I think it's worthwhile to keep this as, in particular it's very useful when debugging distributed training code. It might make more sense to just change the default here to be the same default as what NCCL uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think folks reported that the new timeout did not actually take effect in CI runs. I don't know specifics, but if we are going to keep this, we should probably confirm that the timeout change actually takes an effect. (For this, we may go the other way and crank it down to 1ms and confirm that it crashed the training run.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@booxter I'm very skeptical of that claim. We found that the problem wasn't timeouts but rather that writing will eventually break once the EC2 node runs out of storage and the error will appear to be a NCCL failure. So I would bet on the cause of issue being that rather than NCCL timeout not being respected.
I've used this exact setting in my own debugging many times and have had it work perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is it might still be useful to keep this in for debug purposes, especially when doing local development. I vote that we keep this and just change the default timeout to the same one used by NCCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are keeping the variable, just adjusting its behavior a bit to be more friendly to torch defaults + rename it.
This pull request has merge conflicts that must be resolved before it can be |
Reverts #507