[docs] Memory optims #11385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

stevhliu wants to merge 4 commits into huggingface:main from stevhliu:memory-optims

Member

stevhliu commented Apr 22, 2025 •

edited

Loading

Refactors the memory optimization docs and combines it with working with big models (distributed setups).

Let me know if I'm missing anything!

stevhliu added 3 commits

April 22, 2025 11:04


          reformat

ddba58b


          initial

89a7276

fin

1720fd7

HuggingFaceDocBuilderDev commented Apr 22, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu requested a review from sayakpaul

April 22, 2025 23:16

stevhliu mentioned this pull request

[docs] Model cards #11112

Open


          review

b9ab317

stevhliu marked this pull request as ready for review

April 23, 2025 21:13

Heasterian commented Apr 24, 2025

AutoencoderKLWan and AsymmetricAutoencoderKL does not support tiling or slicing (asymmetric have just unused flags), it most likely should be mentioned.

sayakpaul reviewed

View reviewed changes

Member

sayakpaul left a comment

Thanks for the initiative! I left some minor comments, let me know if they make sense.

docs/source/en/optimization/memory.md

+              Use the [`~DiffusionPipeline.reset_device_map`] method to reset the `device_map`. This is necessary if you want to use methods like `.to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped.
+              ```py
+              pipeline.reset_device_map

Member

sayakpaul Apr 25, 2025

Suggested change

      
            pipeline.reset_device_map
          
            pipeline.reset_device_map()

docs/source/en/optimization/memory.md


		Model offloading moves entire models to the GPU instead of selectively moving some layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large.

		> [!WARNING]

Member

sayakpaul Apr 25, 2025

Do we want to add a warning after enable_sequential_cpu_offload() that it's terribly slow and can often appear to be impractical?

docs/source/en/optimization/memory.md

+              Model offloading moves entire models to the GPU instead of selectively moving *some* layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large.
+              > [!WARNING]
+              > To properly offload models after they're called, it is required to run the entire pipeline and models in the expected order. Keep this in mind if models are reused outside the pipeline context after hooks have been installed (see [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) for more details). This is a stateful operation that installs hooks on the model.

Member

sayakpaul Apr 25, 2025

This reads a bit incomplete:

Keep this in mind if models are reused outside the pipeline context after hooks have been installed

Member

sayakpaul Apr 25, 2025

Also, not sure if it would make sense to include but users can still benefit from pipeline.enable_model_cpu_offload() when doing stuff like pipeline.encode_prompt(). See #11376

docs/source/en/optimization/memory.md

Comment on lines +214 to +215

		> [!WARNING]
		> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism.

Member

sayakpaul Apr 25, 2025

Providing some example models would be helpful here. Cc: @a-r-r-o-w

Member

a-r-r-o-w Apr 28, 2025

Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example

docs/source/en/optimization/memory.md

    
              apply_group_offloading(pipe.vae, onload_device=onload_device, offload_type="leaf_level")

              # Use the apply_group_offloading method for other model components

              apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)

              apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level")

Member

sayakpaul Apr 25, 2025

vae is a subclass of ModelMixin so, we should be able to use enable_group_offload().

docs/source/en/optimization/memory.md

    
              PyTorch supports `torch.float8_e4m3fn` and `torch.float8_e5m2` as weight storage dtypes, but they can't be used for computation in many different tensor operations due to unimplemented kernel support. However, you can use these dtypes to store model weights in fp8 precision and upcast them on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting.

              ## FP8 layerwise casting

Member

sayakpaul Apr 25, 2025

Suggested change

      
            ## FP8 layerwise casting
          
            ## Layerwise casting

As it's not specific to FP8.

docs/source/en/optimization/memory.md

    
              Typically, inference on most models is done with `torch.float16` or `torch.bfloat16` weight/computation precision. Layerwise weight-casting cuts down the memory footprint of the model weights by approximately half.

              Layerwise casting stores weights in a smaller data format (`torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.

Member

sayakpaul Apr 25, 2025

Suggested change

      
            Layerwise casting stores weights in a smaller data format (`torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.
          
            Layerwise casting stores weights in a smaller data format (for example: `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision, e.g., `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.

a-r-r-o-w reviewed

View reviewed changes

docs/source/en/optimization/memory.md


		<Tip>
		Sliced VAE saves memory by processing an image in smaller non-overlapping "slices" instead of processing the entire image at once. This reduces peak memory usage because the GPU is only processing a small slice at a time.

Member

a-r-r-o-w Apr 28, 2025

VAE slicing refers to splitting a big batch of input into single batch of data, and processing each batch separately. Typically, users generate one image at a time and in that case this option does not save any memory. If say a user generates 4 images at once using multiple prompts or num_images_per_prompt > 1, then decoding with the VAE would increase the peak activation memory by roughly 4x. Slicing would make it so you decode 1 image at a time instead of all 4 together

Here "an image in smaller non-overlapping 'slices' instead of processing the entire image at once" looks incorrect

docs/source/en/optimization/memory.md

    
              [`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models.

              VAE tiling saves memory by dividing an image into smaller overlapping tiles instead of processing the entire image at once. This also reduces peak memory usage because the GPU is only processing a tile at a time. Unlike sliced VAE, tiled VAE maintains some context between tiles because they overlap which can generate more coherent images.

Member

a-r-r-o-w Apr 28, 2025

Sliced VAE is simply breaking a larger batch of data into batch_size=1 data and sequentially processing it. The comparison here about slicesd VAE vs tiled VAE maintaing context between tiles seems incorrect

docs/source/en/optimization/memory.md


		</Tip>
		Call [`~StableDiffusionPipeline.enable_vae_tiling`] to enable VAE tiling. The generated image may have some tone variation from tile-to-tile because they're decoded separately, but there shouldn't be any obvious seams between the tiles. Tiling is disabled for images that are 512x512 or smaller.

Member

a-r-r-o-w Apr 28, 2025

"disabled for images that are 512x512 or smaller"

This should be true in most cases but is not always true. Maybe we could write this as "Tiling is disabled for resolutions lower than a pre-specified (but configurable) limit, for example 512x512 for the VAE used by StableDiffusionPipeline" or similar?

docs/source/en/optimization/memory.md

Comment on lines +214 to +215

		> [!WARNING]
		> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism.

Member

a-r-r-o-w Apr 28, 2025

Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet