-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Pipeline level quant config #1
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments !
@SunMarc WDYT about the latest updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating ! Much better. I think that we can open an PR on diffusers library now wdyt ?
@SunMarc could you do another pass please? Just want to ensure I got the basic things right? Maybe any lightweight test you wanna run? Codefrom diffusers.quantizers import PipelineQuantizationConfig
from diffusers import DiffusionPipeline
import argparse
import torch
def get_global_config():
quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
modules_to_quantize=["transformer", "text_encoder_2"],
)
return quant_config
def get_granular_config(use_quanto=False):
from diffusers import BitsAndBytesConfig as DiffBitsAndBytesConfig, QuantoConfig
from transformers import BitsAndBytesConfig as TranBitsAndBytesConfig
transformer_config = (
QuantoConfig(weights_dtype="float8")
if use_quanto
else DiffBitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)
)
quant_config = {
"transformer": transformer_config,
"text_encoder_2": TranBitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
),
}
return quant_config
def load_pipeline(quant_config):
pipe = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", quantization_config=quant_config, torch_dtype=torch.bfloat16
).to("cuda")
return pipe
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--use_global_config", action="store_true")
parser.add_argument("--use_quanto", action="store_true")
args = parser.parse_args()
quant_config = get_global_config() if args.use_global_config else get_granular_config(args.use_quanto)
pipe = load_pipeline(quant_config)
pipe_kwargs = {
"prompt": "A cat holding a sign that says hello world",
"height": 1024,
"width": 1024,
"guidance_scale": 3.5,
"num_inference_steps": 50,
"max_sequence_length": 512,
}
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0)).images[0]
image.save(f"quant_global@{args.use_global_config}_quanto@{args.use_quanto}.png") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few nits
self.quant_backend = quant_backend | ||
# Initialize kwargs to be {} to set to the defaults. | ||
self.quant_kwargs = quant_kwargs or {} | ||
self.modules_to_quantize = modules_to_quantize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing i'm asking myself is that we should try to provide to the user the best experience. Is it truly the best default to quantize every component in the pipeline or maybe we should pre-define the component should be quantized. WDYT ?
I'm asking this because we will have to add potentially in every snippet the modules_to_quantize
as you did in your example. The average users also might not know the name of each component + don't know what component should be quantized or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should pre-define the component should be quantized. WDYT ?
We cannot determine that I guess. For example, users may want to quantize CLIP but that is not typically.
Users can quantize the VAE but that is only supported by a backend that supports quantizing Conv2D (or Conv3D) layers. So, I am not sure what is the best course of action here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we could discuss this in the main PR we'd open to diffusers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot determine that I guess. For example, users may want to quantize CLIP but that is not typically.
We still let them do that if they want but I just want to provide a good default. For example, we can put ['transformers'] by default for some pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in the docs? Or default to this value in the PipelineQuantConfig
class itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specific pipeline docs pages or what are you referring to?
I guess for the docs, we will have a guide in the general overview section of Quantization in diffusers: https://huggingface.co/docs/diffusers/main/en/quantization/overview. There we can do a general recommendation.
Or are you referring to something entirely different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about adding a variable to FluxPipeline class potentially for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am not following this at all.
How would the rough changes look like in that case? In my mind, I think users would just want to supply a pipeline-level quant config while doing DiffusionPipeline.from_pretrained()
and be done with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to add a variable like modules_to_quantize
that is a bigger discussion we should be having with the rest of the team. But IMO we are likely ready to take this PR and show how changes for allowing pipeline-level quant configs would look like.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's open the pr on diffusers repo
Co-authored-by: SunMarc <[email protected]> condition better. support mapping. improvements. [Quantization] Add Quanto backend (#10756) * update * updaet * update * update * update * update * update * update * update * update * update * update * Update docs/source/en/quantization/quanto.md Co-authored-by: Sayak Paul <[email protected]> * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * Update src/diffusers/quantizers/quanto/utils.py Co-authored-by: Sayak Paul <[email protected]> * update * update --------- Co-authored-by: Sayak Paul <[email protected]> [Single File] Add single file loading for SANA Transformer (#10947) * added support for from_single_file * added diffusers mapping script * added testcase * bug fix * updated tests * corrected code quality * corrected code quality --------- Co-authored-by: Dhruv Nair <[email protected]> [LoRA] Improve warning messages when LoRA loading becomes a no-op (#10187) * updates * updates * updates * updates * notebooks revert * fix-copies. * seeing * fix * revert * fixes * fixes * fixes * remove print * fix * conflicts ii. * updates * fixes * better filtering of prefix. --------- Co-authored-by: hlky <[email protected]> [LoRA] CogView4 (#10981) * update * make fix-copies * update [Tests] improve quantization tests by additionally measuring the inference memory savings (#11021) * memory usage tests * fixes * gguf [`Research Project`] Add AnyText: Multilingual Visual Text Generation And Editing (#8998) * Add initial template * Second template * feat: Add TextEmbeddingModule to AnyTextPipeline * feat: Add AuxiliaryLatentModule template to AnyTextPipeline * Add bert tokenizer from the anytext repo for now * feat: Update AnyTextPipeline's modify_prompt method This commit adds improvements to the modify_prompt method in the AnyTextPipeline class. The method now handles special characters and replaces selected string prompts with a placeholder. Additionally, it includes a check for Chinese text and translation using the trans_pipe. * Fill in the `forward` pass of `AuxiliaryLatentModule` * `make style && make quality` * `chore: Update bert_tokenizer.py with a TODO comment suggesting the use of the transformers library` * Update error handling to raise and logging * Add `create_glyph_lines` function into `TextEmbeddingModule` * make style * Up * Up * Up * Up * Remove several comments * refactor: Remove ControlNetConditioningEmbedding and update code accordingly * Up * Up * up * refactor: Update AnyTextPipeline to include new optional parameters * up * feat: Add OCR model and its components * chore: Update `TextEmbeddingModule` to include OCR model components and dependencies * chore: Update `AuxiliaryLatentModule` to include VAE model and its dependencies for masked image in the editing task * `make style` * refactor: Update `AnyTextPipeline`'s docstring * Update `AuxiliaryLatentModule` to include info dictionary so that text processing is done once * simplify * `make style` * Converting `TextEmbeddingModule` to ordinary `encode_prompt()` function * Simplify for now * `make style` * Up * feat: Add scripts to convert AnyText controlnet to diffusers * `make style` * Fix: Move glyph rendering to `TextEmbeddingModule` from `AuxiliaryLatentModule` * make style * Up * Simplify * Up * feat: Add safetensors module for loading model file * Fix device issues * Up * Up * refactor: Simplify * refactor: Simplify code for loading models and handling data types * `make style` * refactor: Update to() method in FrozenCLIPEmbedderT3 and TextEmbeddingModule * refactor: Update dtype in embedding_manager.py to match proj.weight * Up * Add attribution and adaptation information to pipeline_anytext.py * Update usage example * Will refactor `controlnet_cond_embedding` initialization * Add `AnyTextControlNetConditioningEmbedding` template * Refactor organization * style * style * Move custom blocks from `AuxiliaryLatentModule` to `AnyTextControlNetConditioningEmbedding` * Follow one-file policy * style * [Docs] Update README and pipeline_anytext.py to use AnyTextControlNetModel * [Docs] Update import statement for AnyTextControlNetModel in pipeline_anytext.py * [Fix] Update import path for ControlNetModel, ControlNetOutput in anytext_controlnet.py * Refactor AnyTextControlNet to use configurable conditioning embedding channels * Complete control net conditioning embedding in AnyTextControlNetModel * up * [FIX] Ensure embeddings use correct device in AnyTextControlNetModel * up * up * style * [UPDATE] Revise README and example code for AnyTextPipeline integration with DiffusionPipeline * [UPDATE] Update example code in anytext.py to use correct font file and improve clarity * down * [UPDATE] Refactor BasicTokenizer usage to a new Checker class for text processing * update pillow * [UPDATE] Remove commented-out code and unnecessary docstring in anytext.py and anytext_controlnet.py for improved clarity * [REMOVE] Delete frozen_clip_embedder_t3.py as it is in the anytext.py file * [UPDATE] Replace edict with dict for configuration in anytext.py and RecModel.py for consistency * 🆙 * style * [UPDATE] Revise README.md for clarity, remove unused imports in anytext.py, and add author credits in anytext_controlnet.py * style * Update examples/research_projects/anytext/README.md Co-authored-by: Aryan <[email protected]> * Remove commented-out image preparation code in AnyTextPipeline * Remove unnecessary blank line in README.md [Quantization] Allow loading TorchAO serialized Tensor objects with torch>=2.6 (#11018) * update * update * update * update * update * update * update * update * update fix: mixture tiling sdxl pipeline - adjust gerating time_ids & embeddings (#11012) small fix on generating time_ids & embeddings [LoRA] support wan i2v loras from the world. (#11025) * support wan i2v loras from the world. * remove copied from. * upates * add lora. Fix SD3 IPAdapter feature extractor (#11027) chore: fix help messages in advanced diffusion examples (#10923) Fix missing **kwargs in lora_pipeline.py (#11011) * Update lora_pipeline.py * Apply style fixes * fix-copies --------- Co-authored-by: hlky <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Fix for multi-GPU WAN inference (#10997) Ensure that hidden_state and shift/scale are on the same device when running with multiple GPUs Co-authored-by: Jimmy <39@🇺🇸.com> [Refactor] Clean up import utils boilerplate (#11026) * update * update * update Use `output_size` in `repeat_interleave` (#11030) [hybrid inference 🍯🐝] Add VAE encode (#11017) * [hybrid inference 🍯🐝] Add VAE encode * _toctree: add vae encode * Add endpoints, tests * vae_encode docs * vae encode benchmarks * api reference * changelog * Update docs/source/en/hybrid_inference/overview.md Co-authored-by: Sayak Paul <[email protected]> * update --------- Co-authored-by: Sayak Paul <[email protected]> Wan Pipeline scaling fix, type hint warning, multi generator fix (#11007) * Wan Pipeline scaling fix, type hint warning, multi generator fix * Apply suggestions from code review [LoRA] change to warning from info when notifying the users about a LoRA no-op (#11044) * move to warning. * test related changes. Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline (#10827) * Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline --------- Co-authored-by: YiYi Xu <[email protected]> making ```formatted_images``` initialization compact (#10801) compact writing Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: YiYi Xu <[email protected]> Fix aclnnRepeatInterleaveIntWithDim error on NPU for get_1d_rotary_pos_embed (#10820) * get_1d_rotary_pos_embed support npu * Update src/diffusers/models/embeddings.py --------- Co-authored-by: Kai zheng <[email protected]> Co-authored-by: hlky <[email protected]> Co-authored-by: YiYi Xu <[email protected]> [Tests] restrict memory tests for quanto for certain schemes. (#11052) * restrict memory tests for quanto for certain schemes. * Apply suggestions from code review Co-authored-by: Dhruv Nair <[email protected]> * fixes * style --------- Co-authored-by: Dhruv Nair <[email protected]> [LoRA] feat: support non-diffusers wan t2v loras. (#11059) feat: support non-diffusers wan t2v loras. [examples/controlnet/train_controlnet_sd3.py] Fixes #11050 - Cast prompt_embeds and pooled_prompt_embeds to weight_dtype to prevent dtype mismatch (#11051) Fix: dtype mismatch of prompt embeddings in sd3 controlnet training Co-authored-by: Andreas Jörg <[email protected]> Co-authored-by: Sayak Paul <[email protected]> reverts accidental change that removes attn_mask in attn. Improves fl… (#11065) reverts accidental change that removes attn_mask in attn. Improves flux ptxla by using flash block sizes. Moves encoding outside the for loop. Co-authored-by: Juan Acevedo <[email protected]> Fix deterministic issue when getting pipeline dtype and device (#10696) Co-authored-by: Dhruv Nair <[email protected]> [Tests] add requires peft decorator. (#11037) * add requires peft decorator. * install peft conditionally. * conditional deps. Co-authored-by: DN6 <[email protected]> --------- Co-authored-by: DN6 <[email protected]> CogView4 Control Block (#10809) * cogview4 control training --------- Co-authored-by: OleehyO <[email protected]> Co-authored-by: yiyixuxu <[email protected]> [CI] pin transformers version for benchmarking. (#11067) pin transformers version for benchmarking. updates Fix Wan I2V Quality (#11087) * fix_wan_i2v_quality * Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py Co-authored-by: YiYi Xu <[email protected]> * Update pipeline_wan_i2v.py --------- Co-authored-by: YiYi Xu <[email protected]> Co-authored-by: hlky <[email protected]> LTX 0.9.5 (#10968) * update --------- Co-authored-by: YiYi Xu <[email protected]> Co-authored-by: hlky <[email protected]> make PR GPU tests conditioned on styling. (#11099) Group offloading improvements (#11094) update Fix pipeline_flux_controlnet.py (#11095) * Fix pipeline_flux_controlnet.py * Fix style update readme instructions. (#11096) Co-authored-by: Juan Acevedo <[email protected]> Resolve stride mismatch in UNet's ResNet to support Torch DDP (#11098) Modify UNet's ResNet implementation to resolve stride mismatch in Torch's DDP Fix Group offloading behaviour when using streams (#11097) * update * update Quality options in `export_to_video` (#11090) * Quality options in `export_to_video` * make style improve more. add placeholders for docstrings. formatting. smol fix. solidify validation and annotation
067b2f6
to
316ff46
Compare
This reverts commit 316ff46.
Add 4 Notebooks and update the missing links for the example README.
Co-authored-by: SunMarc <[email protected]>
… some elements (#11073) * Update pipeline_controlnet_inpaint.py * Apply style fixes
* init * update * update * update * make style * update * fix * make it work with guidance distilled models * update * make fix-copies * add tests * update * apply_faster_cache -> apply_fastercache * fix * reorder * update * refactor * update docs * add fastercache to CacheMixin * update tests * Apply suggestions from code review * make style * try to fix partial import error * Apply style fixes * raise warning * update --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* add sana-sprint --------- Co-authored-by: Junsong Chen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Aryan <[email protected]>
… is set (#11039) * Don't use `torch_dtype` when `quantization_config` is set * up * djkajka * Apply suggestions from code review --------- Co-authored-by: Sayak Paul <[email protected]>
* [Documentation] Update README and example code with additional usage instructions for AnyText * [Documentation] Update README for AnyTextPipeline and improve logging in code * Remove wget command for font file from example docstring in anytext.py
…11125) * 1 * change to channel 1 * cogview4 control training * add CacheMixin * 1 * remove initial_input_channels change for val * 1 * update * use 3.5 * new loss * 1 * use imagetoken * for megatron convert * 1 * train con and uc * 2 * remove guidance_scale * Update pipeline_cogview4_control.py * fix * use cogview4 pipeline with timestep * update shift_factor * remove the uncond * add max length * change convert and use GLMModel instead of GLMForCasualLM * fix * [cogview4] Add attention mask support to transformer model * [fix] Add attention mask for padded token * update * remove padding type * Update train_control_cogview4.py * resolve conflicts with #10981 * add control convert * use control format * fix * add missing import * update with cogview4 formate * make style * Update pipeline_cogview4_control.py * Update pipeline_cogview4_control.py * remove * Update pipeline_cogview4_control.py * put back * Apply style fixes --------- Co-authored-by: OleehyO <[email protected]> Co-authored-by: yiyixuxu <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* fix bug in sana conversion script; * add more model paths; --------- Co-authored-by: Sayak Paul <[email protected]>
* update * update * update * add tests * update docs * raise value error * warning for true cfg and guidance scale * fix test
* remove typo from korean controlnet train doc * removed more paragraphs to remain in sync with the english document
) * update * Update docs/source/en/optimization/memory.md * Apply suggestions from code review Co-authored-by: Dhruv Nair <[email protected]> * apply review suggestions * update --------- Co-authored-by: Dhruv Nair <[email protected]>
* update * update * update * update
…ad_lora_adapter` in PeftAdapterMixin class (#11155) set self._hf_peft_config_loaded to True on successful lora load Sets the `_hf_peft_config_loaded` flag if a LoRA is successfully loaded in `load_lora_adapter`. Fixes bug /issues/11148 Co-authored-by: Sayak Paul <[email protected]>
* WanI2V encode_image
* update * update
* update * raise warning and round to nearest multiple of scale factor
…already set (#10918) * Bug fix in ltx * Assume packed latents. --------- Co-authored-by: Dhruv Nair <[email protected]> Co-authored-by: YiYi Xu <[email protected]>
no cuda only
* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update
…XPU (#11191) Signed-off-by: YAO Matrix <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
fix: optional componentes verification on load
* rewrite memory count without implicitly using dimensions by @ic-synth * replace F.pad by built-in padding in Conv3D * in-place sums to reduce memory allocations * fixed trailing whitespace * file reformatted * in-place sums * simpler in-place expressions * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * reverted change
…e dtype (#10301) * allow models to run with a user-provided dtype map instead of a single dtype * make style * Add warning, change `_` to `default` * make style * add test * handle shared tensors * remove warning --------- Co-authored-by: Sayak Paul <[email protected]>
…#11197) * add xpu part * fix more cases * remove some cases * no canny * format fix
…ization=False in test (#11196)
Just a PoC to see things.