Skip to content

fix gradient allreduce #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2025
Merged

fix gradient allreduce #215

merged 1 commit into from
Jun 16, 2025

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 15, 2025

Summary:

  • fix setting _local_tensor of a dtensor directly
  • fix allreduce bucketized to not use parameter.grad
  • simplify some code

Test Plan:

  • added a test to validate the gradient are saved and set correctly
  • the previous test in local_sgd_test fails because allreduce is not performed on param.grad
  • updated the test to first set the grads, then load the grads to make sure they reflect the allreduce result

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 15, 2025
@tushar00jain tushar00jain force-pushed the pr215 branch 4 times, most recently from 2546bbe to 7aba771 Compare June 16, 2025 07:54
@tushar00jain tushar00jain requested review from d4l3k and H-Huang June 16, 2025 16:13
@tushar00jain tushar00jain marked this pull request as ready for review June 16, 2025 16:13
else:
p.grad = self._grads[name]

del self._grads[name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come del was removed?

Summary:
- fix setting `_local_tensor` of a dtensor directly
- fix allreduce bucketized to not use `parameter.grad`
- simplify some code

Test Plan:
- added a test to validate the gradient are saved and set correctly
- the previous test in `local_sgd_test` fails because allreduce is not performed on `param.grad`
- updated the test to first set the grads, then load the grads to make sure they reflect the allreduce result
@tushar00jain tushar00jain merged commit 87fbc95 into pytorch:main Jun 16, 2025
14 of 15 checks passed
@tushar00jain tushar00jain deleted the pr215 branch June 23, 2025 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants