fix infinite recovery #217

tushar00jain · 2025-06-16T07:54:38Z

Summary:

we don't increase the max_step when a node is catching up because we don't call should_commit
this can lead the node always being behind and get into an infinite recovery loop
note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that if we need to
document another case where should_commit can return True but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce)
make an assert based on the fragment sync schedule to make sure we don't run into this

Test Plan:

tested on a cluster of 3 nodes by removing and adding a node
the max_step and local_step increase in the manager's state dict after both failure and recovery

metrics from the healthy node

Screenshot 2025-06-15 at 10 53 28 PM copy

metrics from the failed and recovered node

Screenshot 2025-06-15 at 10 56 49 PM copy

Stack created with Sapling. Best reviewed with ReviewStack.

d4l3k

LGTM, would be nice to add some mocked test for this case if possible to prevent regressions

torchft/local_sgd.py

d4l3k · 2025-06-18T23:10:51Z

torchft/local_sgd.py

@@ -559,7 +592,7 @@ def __init__(
            _StreamingDiLoCoFragment(
                manager,
                model_fragment,
-                math.floor((sync_every / len(model_fragments)) * (i + 1)),
+                (sync_every // len(model_fragments) * (i + 1)),


Should we assert that this is an exact multiple of model_fragments? I suppose it doesn't matter too much if it's not exact but might surprise people?

i feel more asserts will require people to remember more things on how to tune the settings? don't have a strong pref if you think that's easier to reason about though, happy to make changes later as well

H-Huang · 2025-06-18T23:19:27Z

torchft/local_sgd.py

            fragment.prepare_sync()

        for i, fragment in enumerate(self._fragments):
            if not fragment.should_sync_fragment(step):
                continue

+            if i not in self._first_prepare_sent:


can you add a comment when we would get into this case? It looks like we set it in the loop before so i'm confused.

the loop before could run for a different fragment (not the one we're syncing) depending on the sync schedule

Summary: - we don't increase the max_step when a node is catching up because we don't call should_commit - this can lead the node always being behind and get into an infinite recovery loop - note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that if we need to - document another case where `should_commit` can return `True` but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce) - make an assert based on the fragment sync schedule to make sure we don't run into this Test Plan: - tested on a cluster of 3 nodes by removing and adding a node - the `max_step` and `local_step` increase in the manager's state dict after both failure and recovery metrics from the healthy node <img width="1103" alt="Screenshot 2025-06-15 at 10 53 28 PM copy" src="https://github.com/user-attachments/assets/8640780c-fd20-4266-aa3c-3116776a9c68" /> metrics from the failed and recovered node <img width="1101" alt="Screenshot 2025-06-15 at 10 56 49 PM copy" src="https://github.com/user-attachments/assets/cc2a1c57-715f-4e0a-8e00-7c62da525dc3" />

This was referenced Jun 16, 2025

add tensorboard to training script #214

Merged

enable merging parameters for diloco #212

Draft

fix gradient allreduce #215

Merged

avoid stream synchronization in manager #216

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 16, 2025

tushar00jain force-pushed the pr217 branch from a19a0e3 to 4533426 Compare June 16, 2025 09:05

tushar00jain requested review from d4l3k and H-Huang June 16, 2025 16:14

tushar00jain marked this pull request as ready for review June 16, 2025 16:14

tushar00jain force-pushed the pr217 branch 2 times, most recently from 0bdb164 to 5c31448 Compare June 16, 2025 22:27

tushar00jain mentioned this pull request Jun 16, 2025

fix diloco integration test #218

Merged

d4l3k approved these changes Jun 17, 2025

View reviewed changes

H-Huang approved these changes Jun 17, 2025

View reviewed changes

torchft/local_sgd.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pr217 branch 6 times, most recently from fad3f3e to d61f4f4 Compare June 18, 2025 21:28

tushar00jain requested a review from dzmitry-huba June 18, 2025 21:47

d4l3k approved these changes Jun 18, 2025

View reviewed changes

H-Huang reviewed Jun 18, 2025

View reviewed changes

tushar00jain force-pushed the pr217 branch from d61f4f4 to 705eaec Compare June 19, 2025 00:38

tushar00jain merged commit 9c12821 into pytorch:main Jun 19, 2025
15 checks passed

tushar00jain deleted the pr217 branch June 23, 2025 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix infinite recovery #217

fix infinite recovery #217

Uh oh!

tushar00jain commented Jun 16, 2025 •

edited

Loading

Uh oh!

d4l3k left a comment

Uh oh!

Uh oh!

d4l3k Jun 18, 2025

Uh oh!

tushar00jain Jun 19, 2025 •

edited

Loading

Uh oh!

H-Huang Jun 18, 2025

Uh oh!

tushar00jain Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

fix infinite recovery #217

fix infinite recovery #217

Uh oh!

Conversation

tushar00jain commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d4l3k Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jun 16, 2025 •

edited

Loading

tushar00jain Jun 19, 2025 •

edited

Loading