CogView4 & CogView3 & CogView-3Plus

阅读中文版日本語で読む

🤗 HuggingFace Space 🤖ModelScope Space 🛠️ZhipuAI MaaS(Faster)
👋 WeChat Community 📚 CogView3 Paper

Project Updates

🔥🔥 2025/03/04: We've adapted and open-sourced the diffusers version of CogView-4 model, which has 6B parameters, supports native Chinese input, and Chinese text-to-image generation. You can try it online.
2024/10/13: We've adapted and open-sourced the diffusers version of CogView-3Plus-3B model. You can try it online.
2024/9/29: We've open-sourced CogView3 and CogView-3Plus-3B. CogView3 is a text-to-image system based on cascading diffusion, using a relay diffusion framework. CogView-3Plus is a series of newly developed text-to-image models based on Diffusion Transformer.

Project Plan

Diffusers workflow adaptation
Cog series fine-tuning kits (coming soon)
ControlNet models and training code

Model Introduction

Model Comparison

Model Name	CogView4	CogView3-Plus-3B
Resolution	512 <= H, W <= 2048 H * W <= 2^{21} H, W \mod 32 = 0
Inference Precision	Only supports BF16, FP32
Encoder	GLM-4-9B	T5-XXL
Prompt Language	Chinese, English	English
Prompt Length Limit	1024 Tokens	224 Tokens
Download Links	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel

Memory Usage

DIT models are tested with BF16 precision and batchsize=4, with results shown in the table below:

Resolution	enable_model_cpu_offload OFF	enable_model_cpu_offload ON	enable_model_cpu_offload ON Text Encoder 4bit
512 * 512	33GB	20GB	13G
1280 * 720	35GB	20GB	13G
1024 * 1024	35GB	20GB	13G
1920 * 1280	39GB	20GB	14G
2048 * 2048	43GB	21GB	14G

Additionally, we recommend that your device has at least 32GB of RAM to prevent the process from being killed.

Model Metrics

We've tested on multiple benchmarks and achieved the following scores:

DPG-Bench

Model	Overall	Global	Entity	Attribute	Relation	Other
SDXL	74.65	83.27	82.43	80.91	86.76	80.41
PixArt-alpha	71.11	74.97	79.32	78.60	82.57	76.96
SD3-Medium	84.08	87.90	91.01	88.83	80.70	88.68
DALL-E 3	83.50	90.97	89.61	88.39	90.58	89.83
Flux.1-dev	83.79	85.80	86.79	89.98	90.04	89.90
Janus-Pro-7B	84.19	86.90	88.90	89.40	89.32	89.48
CogView4-6B	85.13	83.85	90.35	91.17	91.14	87.29

GenEval

Model	Overall	Single Obj.	Two Obj.	Counting	Colors	Position	Color attribution
SDXL	0.55	0.98	0.74	0.39	0.85	0.15	0.23
PixArt-alpha	0.48	0.98	0.50	0.44	0.80	0.08	0.07
SD3-Medium	0.74	0.99	0.94	0.72	0.89	0.33	0.60
DALL-E 3	0.67	0.96	0.87	0.47	0.83	0.43	0.45
Flux.1-dev	0.66	0.98	0.79	0.73	0.77	0.22	0.45
Janus-Pro-7B	0.80	0.99	0.89	0.59	0.90	0.79	0.66
CogView4-6B	0.73	0.99	0.86	0.66	0.79	0.48	0.58

T2I-CompBench

Model	Color	Shape	Texture	2D-Spatial	3D-Spatial	Numeracy	Non-spatial Clip	Complex 3-in-1
SDXL	0.5879	0.4687	0.5299	0.2133	0.3566	0.4988	0.3119	0.3237
PixArt-alpha	0.6690	0.4927	0.6477	0.2064	0.3901	0.5058	0.3197	0.3433
SD3-Medium	0.8132	0.5885	0.7334	0.3200	0.4084	0.6174	0.3140	0.3771
DALL-E 3	0.7785	0.6205	0.7036	0.2865	0.3744	0.5880	0.3003	0.3773
Flux.1-dev	0.7572	0.5066	0.6300	0.2700	0.3992	0.6165	0.3065	0.3628
Janus-Pro-7B	0.5145	0.3323	0.4069	0.1566	0.2753	0.4406	0.3137	0.3806
CogView4-6B	0.7786	0.5880	0.6983	0.3075	0.3708	0.6626	0.3056	0.3869

Chinese Text Accuracy Evaluation

Model	Precision	Recall	F1 Score	Pick@4
Kolors	0.6094	0.1886	0.2880	0.1633
CogView4-6B	0.6969	0.5532	0.6168	0.3265

Inference Model

Prompt Optimization

Although CogView4 series models are trained with lengthy synthetic image descriptions, we strongly recommend using a large language model to rewrite prompts before text-to-image generation, which will greatly improve generation quality.

We provide an example script. We recommend running this script to refine your prompts. Note that CogView4 and CogView3 models use different few-shot examples for prompt optimization. They need to be distinguished.

cd inference
python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus" --cogview_version "cogview4"

Inference Model

Run the model with BF16 precision:

from diffusers import CogView4Pipeline
import torch

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")

# Open it for reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview4.png")

For more inference code, please check:

For using BNB int4 to load text encoder and complete inference code annotations, check here.
For using TorchAO int8 or int4 to load text encoder & transformer and complete inference code annotations, check here.
For setting up a gradio GUI DEMO, check here.

Installation

git clone https://github.com/THUDM/CogView4
cd CogView4
git clone https://huggingface.co/THUDM/CogView4-6B
pip install -r inference/requirements.txt

Quickstart

12G VRAM

MODE=1 python inference/gradio_web_demo.py

24G VRAM 32G RAM

MODE=2 python inference/gradio_web_demo.py

24G VRAM 64G RAM

MODE=3 python inference/gradio_web_demo.py

48G VRAM 64G RAM

MODE=4 python inference/gradio_web_demo.py

License

The code in this repository and the CogView3 models are licensed under Apache 2.0.

We welcome and appreciate your code contributions. You can view the contribution guidelines here.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github		.github
inference		inference
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CogView4 & CogView3 & CogView-3Plus

Project Updates

Project Plan

Model Introduction

Model Comparison

Memory Usage

Model Metrics

DPG-Bench

GenEval

T2I-CompBench

Chinese Text Accuracy Evaluation

Inference Model

Prompt Optimization

Inference Model

Installation

Quickstart

License

About

Releases

Packages

Contributors 4

Languages

License

THUDM/CogView4

Folders and files

Latest commit

History

Repository files navigation

CogView4 & CogView3 & CogView-3Plus

Project Updates

Project Plan

Model Introduction

Model Comparison

Memory Usage

Model Metrics

DPG-Bench

GenEval

T2I-CompBench

Chinese Text Accuracy Evaluation

Inference Model

Prompt Optimization

Inference Model

Installation

Quickstart

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages