-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[docs] Memory optims #11385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[docs] Memory optims #11385
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
AutoencoderKLWan and AsymmetricAutoencoderKL does not support tiling or slicing (asymmetric have just unused flags), it most likely should be mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initiative! I left some minor comments, let me know if they make sense.
Use the [`~DiffusionPipeline.reset_device_map`] method to reset the `device_map`. This is necessary if you want to use methods like `.to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped. | ||
|
||
```py | ||
pipeline.reset_device_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline.reset_device_map | |
pipeline.reset_device_map() |
|
||
Model offloading moves entire models to the GPU instead of selectively moving *some* layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large. | ||
|
||
> [!WARNING] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a warning
after enable_sequential_cpu_offload()
that it's terribly slow and can often appear to be impractical?
Model offloading moves entire models to the GPU instead of selectively moving *some* layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large. | ||
|
||
> [!WARNING] | ||
> To properly offload models after they're called, it is required to run the entire pipeline and models in the expected order. Keep this in mind if models are reused outside the pipeline context after hooks have been installed (see [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) for more details). This is a stateful operation that installs hooks on the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads a bit incomplete:
Keep this in mind if models are reused outside the pipeline context after hooks have been installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not sure if it would make sense to include but users can still benefit from pipeline.enable_model_cpu_offload()
when doing stuff like pipeline.encode_prompt()
. See #11376
> [!WARNING] | ||
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing some example models would be helpful here. Cc: @a-r-r-o-w
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example
apply_group_offloading(pipe.vae, onload_device=onload_device, offload_type="leaf_level") | ||
# Use the apply_group_offloading method for other model components | ||
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2) | ||
apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vae
is a subclass of ModelMixin
so, we should be able to use enable_group_offload()
.
|
||
PyTorch supports `torch.float8_e4m3fn` and `torch.float8_e5m2` as weight storage dtypes, but they can't be used for computation in many different tensor operations due to unimplemented kernel support. However, you can use these dtypes to store model weights in fp8 precision and upcast them on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. | ||
## FP8 layerwise casting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## FP8 layerwise casting | |
## Layerwise casting |
As it's not specific to FP8.
|
||
Typically, inference on most models is done with `torch.float16` or `torch.bfloat16` weight/computation precision. Layerwise weight-casting cuts down the memory footprint of the model weights by approximately half. | ||
Layerwise casting stores weights in a smaller data format (`torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layerwise casting stores weights in a smaller data format (`torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality. | |
Layerwise casting stores weights in a smaller data format (for example: `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision, e.g., `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality. |
|
||
<Tip> | ||
Sliced VAE saves memory by processing an image in smaller non-overlapping "slices" instead of processing the entire image at once. This reduces peak memory usage because the GPU is only processing a small slice at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VAE slicing refers to splitting a big batch of input into single batch of data, and processing each batch separately. Typically, users generate one image at a time and in that case this option does not save any memory. If say a user generates 4 images at once using multiple prompts or num_images_per_prompt > 1
, then decoding with the VAE would increase the peak activation memory by roughly 4x. Slicing would make it so you decode 1 image at a time instead of all 4 together
Here "an image in smaller non-overlapping 'slices' instead of processing the entire image at once" looks incorrect
|
||
[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models. | ||
VAE tiling saves memory by dividing an image into smaller overlapping tiles instead of processing the entire image at once. This also reduces peak memory usage because the GPU is only processing a tile at a time. Unlike sliced VAE, tiled VAE maintains some context between tiles because they overlap which can generate more coherent images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sliced VAE is simply breaking a larger batch of data into batch_size=1 data and sequentially processing it. The comparison here about slicesd VAE vs tiled VAE maintaing context between tiles seems incorrect
|
||
</Tip> | ||
Call [`~StableDiffusionPipeline.enable_vae_tiling`] to enable VAE tiling. The generated image may have some tone variation from tile-to-tile because they're decoded separately, but there shouldn't be any obvious seams between the tiles. Tiling is disabled for images that are 512x512 or smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"disabled for images that are 512x512 or smaller"
This should be true in most cases but is not always true. Maybe we could write this as "Tiling is disabled for resolutions lower than a pre-specified (but configurable) limit, for example 512x512 for the VAE used by StableDiffusionPipeline" or similar?
> [!WARNING] | ||
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example
Refactors the memory optimization docs and combines it with working with big models (distributed setups).
Let me know if I'm missing anything!