-
Notifications
You must be signed in to change notification settings - Fork 37
fix infinite recovery #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0bdb164
to
5c31448
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would be nice to add some mocked test for this case if possible to prevent regressions
fad3f3e
to
d61f4f4
Compare
torchft/local_sgd.py
Outdated
@@ -559,7 +592,7 @@ def __init__( | |||
_StreamingDiLoCoFragment( | |||
manager, | |||
model_fragment, | |||
math.floor((sync_every / len(model_fragments)) * (i + 1)), | |||
(sync_every // len(model_fragments) * (i + 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert that this is an exact multiple of model_fragments? I suppose it doesn't matter too much if it's not exact but might surprise people?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel more asserts will require people to remember more things on how to tune the settings? don't have a strong pref if you think that's easier to reason about though, happy to make changes later as well
fragment.prepare_sync() | ||
|
||
for i, fragment in enumerate(self._fragments): | ||
if not fragment.should_sync_fragment(step): | ||
continue | ||
|
||
if i not in self._first_prepare_sent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment when we would get into this case? It looks like we set it in the loop before so i'm confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the loop before could run for a different fragment (not the one we're syncing) depending on the sync schedule
Summary: - we don't increase the max_step when a node is catching up because we don't call should_commit - this can lead the node always being behind and get into an infinite recovery loop - note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that if we need to - document another case where `should_commit` can return `True` but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce) - make an assert based on the fragment sync schedule to make sure we don't run into this Test Plan: - tested on a cluster of 3 nodes by removing and adding a node - the `max_step` and `local_step` increase in the manager's state dict after both failure and recovery metrics from the healthy node <img width="1103" alt="Screenshot 2025-06-15 at 10 53 28 PM copy" src="https://github.com/user-attachments/assets/8640780c-fd20-4266-aa3c-3116776a9c68" /> metrics from the failed and recovered node <img width="1101" alt="Screenshot 2025-06-15 at 10 56 49 PM copy" src="https://github.com/user-attachments/assets/cc2a1c57-715f-4e0a-8e00-7c62da525dc3" />
Summary:
should_commit
can returnTrue
but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce)Test Plan:
max_step
andlocal_step
increase in the manager's state dict after both failure and recoverymetrics from the healthy node
metrics from the failed and recovered node
Stack created with Sapling. Best reviewed with ReviewStack.