🤗 HuggingFace Space
🤖ModelScope Space
🛠️ZhipuAI MaaS(Faster)
👋 WeChat Community 📚 CogView3 Paper
- 🔥🔥
2025/03/04
: We've adapted and open-sourced the diffusers version of CogView-4 model, which has 6B parameters, supports native Chinese input, and Chinese text-to-image generation. You can try it online. 2024/10/13
: We've adapted and open-sourced the diffusers version of CogView-3Plus-3B model. You can try it online.2024/9/29
: We've open-sourced CogView3 and CogView-3Plus-3B. CogView3 is a text-to-image system based on cascading diffusion, using a relay diffusion framework. CogView-3Plus is a series of newly developed text-to-image models based on Diffusion Transformer.
- Diffusers workflow adaptation
- Cog series fine-tuning kits (coming soon)
- ControlNet models and training code
Model Name | CogView4 | CogView3-Plus-3B |
---|---|---|
Resolution |
512 <= H, W <= 2048 H * W <= 2^{21} H, W \mod 32 = 0 |
|
Inference Precision | Only supports BF16, FP32 | |
Encoder | GLM-4-9B | T5-XXL |
Prompt Language | Chinese, English | English |
Prompt Length Limit | 1024 Tokens | 224 Tokens |
Download Links | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
DIT models are tested with BF16
precision and batchsize=4
, with results shown in the table below:
Resolution | enable_model_cpu_offload OFF | enable_model_cpu_offload ON | enable_model_cpu_offload ON Text Encoder 4bit |
---|---|---|---|
512 * 512 | 33GB | 20GB | 13G |
1280 * 720 | 35GB | 20GB | 13G |
1024 * 1024 | 35GB | 20GB | 13G |
1920 * 1280 | 39GB | 20GB | 14G |
2048 * 2048 | 43GB | 21GB | 14G |
Additionally, we recommend that your device has at least 32GB
of RAM to prevent the process from being killed.
We've tested on multiple benchmarks and achieved the following scores:
Model | Overall | Global | Entity | Attribute | Relation | Other |
---|---|---|---|---|---|---|
SDXL | 74.65 | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 |
PixArt-alpha | 71.11 | 74.97 | 79.32 | 78.60 | 82.57 | 76.96 |
SD3-Medium | 84.08 | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 |
DALL-E 3 | 83.50 | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 |
Flux.1-dev | 83.79 | 85.80 | 86.79 | 89.98 | 90.04 | 89.90 |
Janus-Pro-7B | 84.19 | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 |
CogView4-6B | 85.13 | 83.85 | 90.35 | 91.17 | 91.14 | 87.29 |
Model | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Color attribution |
---|---|---|---|---|---|---|---|
SDXL | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
PixArt-alpha | 0.48 | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 |
SD3-Medium | 0.74 | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 |
DALL-E 3 | 0.67 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 |
Flux.1-dev | 0.66 | 0.98 | 0.79 | 0.73 | 0.77 | 0.22 | 0.45 |
Janus-Pro-7B | 0.80 | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 |
CogView4-6B | 0.73 | 0.99 | 0.86 | 0.66 | 0.79 | 0.48 | 0.58 |
Model | Color | Shape | Texture | 2D-Spatial | 3D-Spatial | Numeracy | Non-spatial Clip | Complex 3-in-1 |
---|---|---|---|---|---|---|---|---|
SDXL | 0.5879 | 0.4687 | 0.5299 | 0.2133 | 0.3566 | 0.4988 | 0.3119 | 0.3237 |
PixArt-alpha | 0.6690 | 0.4927 | 0.6477 | 0.2064 | 0.3901 | 0.5058 | 0.3197 | 0.3433 |
SD3-Medium | 0.8132 | 0.5885 | 0.7334 | 0.3200 | 0.4084 | 0.6174 | 0.3140 | 0.3771 |
DALL-E 3 | 0.7785 | 0.6205 | 0.7036 | 0.2865 | 0.3744 | 0.5880 | 0.3003 | 0.3773 |
Flux.1-dev | 0.7572 | 0.5066 | 0.6300 | 0.2700 | 0.3992 | 0.6165 | 0.3065 | 0.3628 |
Janus-Pro-7B | 0.5145 | 0.3323 | 0.4069 | 0.1566 | 0.2753 | 0.4406 | 0.3137 | 0.3806 |
CogView4-6B | 0.7786 | 0.5880 | 0.6983 | 0.3075 | 0.3708 | 0.6626 | 0.3056 | 0.3869 |
Model | Precision | Recall | F1 Score | Pick@4 |
---|---|---|---|---|
Kolors | 0.6094 | 0.1886 | 0.2880 | 0.1633 |
CogView4-6B | 0.6969 | 0.5532 | 0.6168 | 0.3265 |
Although CogView4 series models are trained with lengthy synthetic image descriptions, we strongly recommend using a large language model to rewrite prompts before text-to-image generation, which will greatly improve generation quality.
We provide an example script. We recommend running this script to refine your prompts.
Note that CogView4
and CogView3
models use different few-shot examples for prompt optimization. They need to be
distinguished.
cd inference
python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus" --cogview_version "cogview4"
Run the model with BF16
precision:
from diffusers import CogView4Pipeline
import torch
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")
# Open it for reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
prompt=prompt,
guidance_scale=3.5,
num_images_per_prompt=1,
num_inference_steps=50,
width=1024,
height=1024,
).images[0]
image.save("cogview4.png")
For more inference code, please check:
- For using
BNB int4
to loadtext encoder
and complete inference code annotations, check here. - For using
TorchAO int8 or int4
to loadtext encoder & transformer
and complete inference code annotations, check here. - For setting up a
gradio
GUI DEMO, check here.
git clone https://github.com/THUDM/CogView4
cd CogView4
git clone https://huggingface.co/THUDM/CogView4-6B
pip install -r inference/requirements.txt
12G VRAM
MODE=1 python inference/gradio_web_demo.py
24G VRAM 32G RAM
MODE=2 python inference/gradio_web_demo.py
24G VRAM 64G RAM
MODE=3 python inference/gradio_web_demo.py
48G VRAM 64G RAM
MODE=4 python inference/gradio_web_demo.py
The code in this repository and the CogView3 models are licensed under Apache 2.0.
We welcome and appreciate your code contributions. You can view the contribution guidelines here.