Skip to content

[single file] Cosmos #11801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

[single file] Cosmos #11801

wants to merge 5 commits into from

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Jun 24, 2025

Possibly fixes #11798

We can run inference with the 7B Text-to-World model with the following code:

import torch
from diffusers import CosmosTextToWorldPipeline, CosmosTransformer3DModel
from diffusers.utils import export_to_video

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
transformer_single_file = "https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World/blob/main/model.pt"

transformer = CosmosTransformer3DModel.from_single_file(transformer_single_file, torch_dtype=torch.bfloat16).to("cuda")
pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30)

@DN6 I'm not sure I remember how to support different versions of the same model. With the current implementation, if we tried loading the 14B model, it would fail with a weight shape mismatch. This is most likely to do with config-related issues. Could you share some insights?

For Cosmos 1.0 text-to-world and video-to-world models 7B and 14B models, I'll have to make a cosmos-1.0 entry. Another entry cosmos-2.0 for Cosmos Predict2 models. But, what's the normal process for model of same family but different parameter sizes?

@a-r-r-o-w a-r-r-o-w requested a review from DN6 June 24, 2025 20:25
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Vargol
Copy link

Vargol commented Jun 26, 2025

While I'm not an expert of the diffusers code base as far as I can see, based on WAN which also has multiple parameter counts they're just treated as different model types e.g. in src/diffusers/loaders/single_file_utils.py

        if checkpoint[target_key].shape[0] == 1536:
            model_type = "wan-t2v-1.3B"
        elif checkpoint[target_key].shape[0] == 5120 and checkpoint[target_key].shape[1] == 16:
            model_type = "wan-t2v-14B"
        else:
            model_type = "wan-i2v-14B"

@DN6
Copy link
Collaborator

DN6 commented Jun 27, 2025

@a-r-r-o-w I think just run a shape check on the params to determine which config to use. I think this should be sufficient to differentiate?

@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review June 27, 2025 20:04
@a-r-r-o-w
Copy link
Member Author

@Vargol Could you verify if the latest changes work for you?

@Vargol
Copy link

Vargol commented Jun 27, 2025

The Comos 2B single file at https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/resolve/main/model.pt loaded and successfully ran and generated the expected image.

I tried a GGUF file for the 14B version and that didn't work. I'm not sure if that was in scope though. If it was
the error is..

$ python cosmos_gguf_prmpts.py 
Multiple distributions found for package optimum. Picked distribution: optimum-quanto
WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work
W0627 23:30:48.574000 85696 lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
The config attributes {'input_types': ['text'], 'model_size': '14b'} were passed to CosmosTransformer3DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/cosmos_gguf_prmpts.py", line 12, in <module>
    transformer = CosmosTransformer3DModel.from_single_file(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/loaders/single_file_model.py", line 420, in from_single_file
    load_model_dict_into_meta(
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/models/model_loading_utils.py", line 285, in load_model_dict_into_meta
    hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name], param)
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/quantizers/gguf/gguf_quantizer.py", line 84, in check_quantized_param_shape
    raise ValueError(
ValueError: patch_embed.proj.weight has an expected quantized shape of: (5120, 68), but received shape: torch.Size([5120, 136])
$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single File and GGUF support of Cosmos-Predict2
4 participants