Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable async quorum for the first quorum sync #112

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Feb 19, 2025

If we don't wait for the first quorum, the trainer will continue to run forward and may use incorrect weights if the trainer is healing.

If we don't wait for the first quorum, the trainer will continue to run
forward and may use incorrect weights if the trainer is healing.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2025
@d4l3k
Copy link
Member

d4l3k commented Feb 19, 2025

@fegin do you have more details on where this is being triggered? We can recover in non start cases so we should figure out how to resolve this

Are we not zeroing grads correctly during recovery?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants