Skip to content

fix infinite recovery #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 19, 2025
Merged

fix infinite recovery #217

merged 1 commit into from
Jun 19, 2025

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 16, 2025

Summary:

  • we don't increase the max_step when a node is catching up because we don't call should_commit
  • this can lead the node always being behind and get into an infinite recovery loop
  • note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that if we need to
  • document another case where should_commit can return True but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce)
  • make an assert based on the fragment sync schedule to make sure we don't run into this

Test Plan:

  • tested on a cluster of 3 nodes by removing and adding a node
  • the max_step and local_step increase in the manager's state dict after both failure and recovery

metrics from the healthy node

Screenshot 2025-06-15 at 10 53 28 PM copy

metrics from the failed and recovered node

Screenshot 2025-06-15 at 10 56 49 PM copy

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 16, 2025
@tushar00jain tushar00jain requested review from d4l3k and H-Huang June 16, 2025 16:14
@tushar00jain tushar00jain marked this pull request as ready for review June 16, 2025 16:14
@tushar00jain tushar00jain force-pushed the pr217 branch 2 times, most recently from 0bdb164 to 5c31448 Compare June 16, 2025 22:27
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would be nice to add some mocked test for this case if possible to prevent regressions

@tushar00jain tushar00jain force-pushed the pr217 branch 6 times, most recently from fad3f3e to d61f4f4 Compare June 18, 2025 21:28
@@ -559,7 +592,7 @@ def __init__(
_StreamingDiLoCoFragment(
manager,
model_fragment,
math.floor((sync_every / len(model_fragments)) * (i + 1)),
(sync_every // len(model_fragments) * (i + 1)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert that this is an exact multiple of model_fragments? I suppose it doesn't matter too much if it's not exact but might surprise people?

Copy link
Contributor Author

@tushar00jain tushar00jain Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel more asserts will require people to remember more things on how to tune the settings? don't have a strong pref if you think that's easier to reason about though, happy to make changes later as well

fragment.prepare_sync()

for i, fragment in enumerate(self._fragments):
if not fragment.should_sync_fragment(step):
continue

if i not in self._first_prepare_sent:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment when we would get into this case? It looks like we set it in the loop before so i'm confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the loop before could run for a different fragment (not the one we're syncing) depending on the sync schedule

Summary:
- we don't increase the max_step when a node is catching up because we don't call should_commit
- this can lead the node always being behind and get into an infinite recovery loop
- note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that if we need to
- document another case where `should_commit` can return `True` but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce)
- make an assert based on the fragment sync schedule to make sure we don't run into this

Test Plan:
- tested on a cluster of 3 nodes by removing and adding a node
- the `max_step` and `local_step` increase in the manager's state dict after both failure and recovery

metrics from the healthy node

<img width="1103" alt="Screenshot 2025-06-15 at 10 53 28 PM copy" src="https://github.com/user-attachments/assets/8640780c-fd20-4266-aa3c-3116776a9c68" />

metrics from the failed and recovered node

<img width="1101" alt="Screenshot 2025-06-15 at 10 56 49 PM copy" src="https://github.com/user-attachments/assets/cc2a1c57-715f-4e0a-8e00-7c62da525dc3" />
@tushar00jain tushar00jain merged commit 9c12821 into pytorch:main Jun 19, 2025
15 checks passed
@tushar00jain tushar00jain deleted the pr217 branch June 23, 2025 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants