mupify

$\mu\mathrm{P}$-ify your pytorch models! It's super easy: just initialize your model and SGD optimizer as you normally would, then pass them both to mupify. They'll be modified in-place so that the forward/backward passes reflect the $\mu\mathrm{P}$ training regime (or lazy regime if you choose)... modulo finite-width corrections. See example.ipynb for a tutorial.

If you use this in a research project, please consider citing https://arxiv.org/abs/2404.19719!

Not intended for use with any of the following:

Adaptive optimizers. (SGD + momentum and/or weight decay are fine.)
Linear layers other than dense linear layers or 2d convolutions.
Attention blocks

Important notes:

nn.ReLU() layers are mupified to evaluate $\mathrm{max}(0, x\sqrt{2})$ rather than $\mathrm{max}(0, x)$. To avoid this behavior, use torch.functional.relu
The user-facing functions are mupify(model, optimizer, param) and rescale(model, gamma). See documentation in mupify.py.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
example.ipynb		example.ipynb
mupify.py		mupify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mupify

About

Releases

Packages

Languages

dkarkada/mupify

Folders and files

Latest commit

History

Repository files navigation

mupify

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages