Skip to content

Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more

Latest
Compare
Choose a tag to compare
@sayakpaul sayakpaul released this 24 Jun 15:13
· 10 commits to main since this release

📹 New video generation pipelines

Wan VACE

Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:

  • Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
  • Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
  • Inpainting and Outpainting
  • Subject to Video (faces, object, characters, etc.)
  • Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)

The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.

Check out the docs to learn more.

Cosmos Predict2 Video2World

Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.

The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.

LTX 0.9.7 and Distilled

LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.

Check out the docs to learn more.

Hunyuan Video Framepack and F1

Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.

FusionX

The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():

transformer = AutoModel.from_single_file(
    "https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
    torch_dtype=torch.bfloat16
)

To load the LoRAs, use load_lora_weights():

pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
    "vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)

AccVideo and CausVid (only LoRAs)

AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.

Chroma

Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more

Thanks to @Ednaordinary for contributing it in this PR!

VisualCloze

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:

  1. Support for various in-domain tasks
  2. Generalization to unseen tasks through in-context learning
  3. Unify multiple tasks into one step and generate both target image and intermediate results
  4. Support reverse-engineering conditions from target images

Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!

Better torch.compile support

We have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:

Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:

Code
import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:

You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:

Code
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel

import torch
torch._dynamo.config.recompile_limit = 1000 

quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)

ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
    ckpt_id,
    subfolder="text_encoder_2",
    quantization_config=text_encoder_2_quant_config,
    torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
    ckpt_id,
    subfolder="transformer",
    quantization_config=dit_quant_config,
    torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=transformer,
    text_encoder_2=text_encoder_2,
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=3.5,
    height=768,
    width=1360,
    num_inference_steps=28,
    max_sequence_length=512,
).images[0]

Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.

image

Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.

Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.

PipelineQuantizationConfig

Users can now provide a quantization config while initializing a pipeline:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
     quant_backend="bitsandbytes_4bit",
     quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
     components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about different configurations allowed through PipelineQuantizationConfig.

Group offloading with disk

In the previous release, we shipped “group offloading” which lets you offload blocks/nodes within a model, optimizing its memory consumption. It also lets you overlap this offloading with computation, providing a good speed-memory trade-off, especially in low VRAM environments.

However, you still need a considerable amount of system RAM to make offloading work effectively. So, low VRAM and low RAM environments would still not work.

Starting this release, users will additionally have the option to offload to disk instead of RAM, further lowering memory consumption. Set the offload_to_disk_path to enable this feature.

pipeline.transformer.enable_group_offload(
    onload_device="cuda", 
    offload_device="cpu", 
    offload_type="leaf_level", 
    offload_to_disk_path="path/to/disk"
)

Refer to these two tables to compare the speed and memory trade-offs.

LoRA metadata parsing

It is beneficial to include the LoraConfig in a LoRA state dict that was used to train the LoRA. In its absence, users were restricted to using the same LoRA alpha as the LoRA rank. We have modified the most popular training scripts to allow passing custom lora_alpha through the CLI. Refer to this thread for more updates. Refer to this comment for some extended clarifications.

New training scripts

  • We now have a capable training script for training robust timestep-distilled models through the SANA Sprint framework. Check out this resource for more details. Thanks to @scxue and @lawrence-cj for contributing it in this PR.
  • HiDream LoRA DreamBooth training script (docs). The script supports training with quantization. HiDream is an MIT-licensed model. So, make it yours with this training script.

Updates on educational materials on quantization

We have worked on a two-part series discussing the support of quantization in Diffusers. Check them out:

All commits

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @yao-matrix
    • fix test_vanilla_funetuning failure on XPU and A100 (#11263)
    • make test_stable_diffusion_inpaint_fp16 pass on XPU (#11264)
    • make test_dict_tuple_outputs_equivalent pass on XPU (#11265)
    • make test_instant_style_multiple_masks pass on XPU (#11266)
    • make KandinskyV22PipelineInpaintCombinedFastTests::test_float16_inference pass on XPU (#11308)
    • make test_stable_diffusion_karras_sigmas pass on XPU (#11310)
    • fix CPU offloading related fail cases on XPU (#11288)
    • enable 2 test cases on XPU (#11332)
    • enable group_offload cases and quanto cases on XPU (#11405)
    • enable test_layerwise_casting_memory cases on XPU (#11406)
    • enable 28 GGUF test cases on XPU (#11404)
    • enable marigold_intrinsics cases on XPU (#11445)
    • enable consistency test cases on XPU, all passed (#11446)
    • enable unidiffuser test cases on xpu (#11444)
    • make safe diffusion test cases pass on XPU and A100 (#11458)
    • make autoencoders. controlnet_flux and wan_transformer3d_single_file pass on xpu (#11461)
    • enable semantic diffusion and stable diffusion panorama cases on XPU (#11459)
    • enable lora cases on XPU (#11506)
    • enable 7 cases on XPU (#11503)
    • enable dit integration cases on xpu (#11523)
    • enable print_env on xpu (#11507)
    • enable several pipeline integration tests on XPU (#11526)
    • enhance value guard of _device_agnostic_dispatch (#11553)
    • enable pipeline test cases on xpu (#11527)
    • enable group_offloading and PipelineDeviceAndDtypeStabilityTests on XPU, all passed (#11620)
    • enable torchao test cases on XPU and switch to device agnostic APIs for test cases (#11654)
    • enable cpu offloading of new pipelines on XPU & use device agnostic empty to make pipelines work on XPU (#11671)
  • @hlky
    • Fix LTX 0.9.5 single file (#11271)
    • HiDream Image (#11231)
    • Use float32 on mps or npu in transformer_hidream_image's rope (#11316)
    • Fix vae.Decoder prev_output_channel (#11280)
  • @quickjkee
    • flow matching lcm scheduler (#11170)
  • @ishan-modi
    • [ControlNet] Adds controlnet for SanaTransformer (#11040)
    • [BUG] fixed _toctree.yml alphabetical ordering (#11277)
    • [BUG] fixes in kadinsky pipeline (#11080)
    • [Refactor] Minor Improvement for import utils (#11161)
    • [Feature] Added Xlab Controlnet support (#11249)
    • [BUG] fixed WAN docstring (#11226)
    • [Feature] AutoModel can load components using model_index.json (#11401)
  • @linoytsaban
    • [HiDream] code example (#11317)
    • [Flux LoRAs] fix lr scheduler bug in distributed scenarios (#11242)
    • [LoRA] add LoRA support to HiDream and fine-tuning script (#11281)
    • [HiDream LoRA] optimizations + small updates (#11381)
    • [Hi-Dream LoRA] fix bug in validation (#11439)
    • [LoRA] make lora alpha and dropout configurable (#11467)
    • [LoRA] small change to support Hunyuan LoRA Loading for FramePack (#11546)
    • [LoRA] support non-diffusers LTX-Video loras (#11572)
    • [LoRA] kijai wan lora support for I2V (#11588)
    • [training docs] smol update to README files (#11616)
    • [Sana Sprint] add image-to-image pipeline (#11602)
    • [LoRA training] update metadata use for lora alpha + README (#11723)
  • @hameerabbasi
    • [LoRA] Add LoRA support to AuraFlow (#10216)
  • @DN6
    • Fix Hunyuan I2V for transformers>4.47.1 (#11293)
    • Hunyuan I2V fast tests fix (#11341)
    • [Single File] GGUF/Single File Support for HiDream (#11550)
    • [Single File] Fix loading for LTX 0.9.7 transformer (#11578)
    • Type annotation fix (#11597)
    • Fix mixed variant downloading (#11611)
    • [CI] Some improvements to Nightly reports summaries (#11166)
    • Introduce DeprecatedPipelineMixin to simplify pipeline deprecation process (#11596)
    • Chroma Follow Up (#11725)
    • [CI] Fix WAN VACE tests (#11757)
    • [CI] Fix SANA tests (#11756)
    • Fix HiDream pipeline test module (#11754)
    • Update Chroma Docs (#11753)
    • Fix failing cpu offload test for LTX Latent Upscale (#11755)
    • [CI] Skip ONNX Upscale tests (#11774)
  • @yiyixuxu
    • [Hi Dream] follow-up (#11296)
    • support Wan-FLF2V (#11353)
    • update output for Hidream transformer (#11366)
    • [Wan2.1-FLF2V] update conversion script (#11365)
    • [HiDream] move deprecation to 0.35.0 (#11384)
    • clean up the Init for stable_diffusion (#11500)
    • [lora] only remove hooks that we add back (#11768)
  • @Teriks
    • Kolors additional pipelines, community contrib (#11372)
  • @co63oc
    • Fix typos in strings and comments (#11407)
    • Fix typos in docs and comments (#11416)
    • Fix typos in strings and comments (#11476)
  • @xduzhangjiayu
    • Add StableDiffusion3InstructPix2PixPipeline (#11378)
  • @scxue
    • Add cross attention type for Sana-Sprint training in diffusers. (#11514)
  • @lzyhha
  • @b-sai
    • RegionalPrompting: Inherit from Stable Diffusion (#11525)
  • @Ednaordinary