Skip to content

New OptimizerInBackward #2719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 19, 2025
Merged

Conversation

joecummings
Copy link
Contributor

@joecummings joecummings commented May 12, 2025

OptimizerInBackward

Motivation

Previously, we utilized a process very similar to the one outlined in this blog post. Although it worked, it involved lots of if/else switching in the recipe based on what composed well with optimizer fused in backwards and what did not.

Goal

Simplify our recipes (729 LOC -> 677 LOC)! This PR provides a canonical OptimizerInBackward class that can be used as a drop-in replacement for any PyTorch optimizer. It integrates into the full_finetune_single_device.py recipe and adds tests + updates documentation.

Testing

Screenshot 2025-05-15 at 7 16 23 PM

To-do

  • Integrate into full_finetune_distributed.py
  • Integrate into rest of recipes
  • Deprecate the utility functions used to create the optim in backward hooks previously

Copy link

pytorch-bot bot commented May 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2719

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ba8a975 with merge base e5ee1b2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2025
@joecummings joecummings changed the title [WIP] FusedOptimizerInBackward (redesign) New OptimizerInBackward May 14, 2025
@joecummings joecummings marked this pull request as ready for review May 14, 2025 22:37
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 74.10072% with 36 lines in your changes missing coverage. Please review.

Project coverage is 60.12%. Comparing base (c8e670b) to head (5674f9a).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
recipes/full_finetune_single_device.py 0.00% 35 Missing ⚠️
torchtune/modules/optim.py 97.05% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2719      +/-   ##
==========================================
- Coverage   60.64%   60.12%   -0.52%     
==========================================
  Files         428      432       +4     
  Lines       26091    26520     +429     
==========================================
+ Hits        15823    15946     +123     
- Misses      10268    10574     +306     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

with a model with a lot of parameters, and when you don't need to use :ref:`gradient accumulation <glossary_grad_accm>`.
.. code-block:: python

OptimizerInBackward(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was already discussed but looking at it here it's confusing that we have a class that's meant to be a drop in optimizer, but if you actually want to use it in the recipe we aren't using torchtune's regular instantiation-based configuation for this component, but rely on logic inside the recipe itself to construct the optimizer.

I don't have a neat solution that doesn't involve breaking all the configs unfortunately. Maybe we don't advertise this as a separate modular component but as a core recipe feature? Power users can bear the brunt of implementing it in their own recipes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I'll work on the wording here.

)
else:
if is_not_distributed_checkpointer:
# This check can be removed once we fully migrate over to ``OptimizerInBackward``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth raising an issue to follow up on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair, I might just opt to raise a meta-Issue tracking how we integrate this into the rest of our recipes and any cleanup that needs to happen to deprecate the old material.

Copy link
Collaborator

@SalmanMohammadi SalmanMohammadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really clean work

@joecummings joecummings added the triage review This issue should be discussed in weekly review label May 19, 2025
@krammnic
Copy link
Contributor

Amazing work! (I've attempted this, but my RFC was rejected, so I'm really glad to see this merged.)

@joecummings joecummings removed the triage review This issue should be discussed in weekly review label May 19, 2025
@joecummings joecummings merged commit 83c8f97 into pytorch:main May 19, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants