CogView4 & CogView3 & CogView-3Plus

Read this in English 日本語で読む

🤗 HuggingFace Space 🤖 ModelScope Space 🛠️ 智谱MaaS平台(更快)
👋 微信社区 📚 CogView3 论文

项目更新

🔥🔥 2025/03/04: 我们适配和开源了 diffusers 版本的 CogView-4 模型，该模型具有6B权重，支持原生中文输入，支持中文文字绘画。你可以前往在线体验。
2024/10/13: 我们适配和开源了 diffusers 版本的 CogView-3Plus-3B 模型。你可以前往在线体验。
2024/9/29: 我们已经开源了 CogView3 以及 CogView-3Plus-3B 。CogView3 是一个基于级联扩散的文本生成图像系统，采用了接力扩散框架。 CogView-3Plus 是一系列新开发的基 Diffusion Transformer 的文本生成图像模型。

项目计划

diffusers 工作流适配
Cog系列微调套件 (即将到来)
ControlNet模型和训练代码

模型介绍

模型对比

模型名称	CogView4	CogView3-Plus-3B
分辨率	512 <= H, W <= 2048 H * W <= 2^{21} H, W \mod 32 = 0
推理精度	仅支持BF16, FP32
编码器	GLM-4-9B	T5-XXL
提示词语言	中文，English	English
提示词长度上限	1024 Tokens	224 Tokens
下载链接	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel

显存占用

DIT模型均使用 BF16 精度, batchsize=4 进行测试，测试结果如下表所示:

分辨率	enable_model_cpu_offload OFF	enable_model_cpu_offload ON	enable_model_cpu_offload ON Text Encoder 4bit
512 * 512	33GB	20GB	13G
1280 * 720	35GB	20GB	13G
1024 * 1024	35GB	20GB	13G
1920 * 1280	39GB	20GB	14G
2048 * 2048	43GB	21GB	14G

此外, 建议您的设备至少拥有32GB内存，以防止进程被杀。

模型指标

我们在多个榜单上进行了测试, 并得到了如下的成绩:

DPG-Bench

Model	Overall	Global	Entity	Attribute	Relation	Other
SDXL	74.65	83.27	82.43	80.91	86.76	80.41
PixArt-alpha	71.11	74.97	79.32	78.60	82.57	76.96
SD3-Medium	84.08	87.90	91.01	88.83	80.70	88.68
DALL-E 3	83.50	90.97	89.61	88.39	90.58	89.83
Flux.1-dev	83.79	85.80	86.79	89.98	90.04	89.90
Janus-Pro-7B	84.19	86.90	88.90	89.40	89.32	89.48
CogView4-6B	85.13	83.85	90.35	91.17	91.14	87.29

GenEval

Model	Overall	Single Obj.	Two Obj.	Counting	Colors	Position	Color attribution
SDXL	0.55	0.98	0.74	0.39	0.85	0.15	0.23
PixArt-alpha	0.48	0.98	0.50	0.44	0.80	0.08	0.07
SD3-Medium	0.74	0.99	0.94	0.72	0.89	0.33	0.60
DALL-E 3	0.67	0.96	0.87	0.47	0.83	0.43	0.45
Flux.1-dev	0.66	0.98	0.79	0.73	0.77	0.22	0.45
Janus-Pro-7B	0.80	0.99	0.89	0.59	0.90	0.79	0.66
CogView4-6B	0.73	0.99	0.86	0.66	0.79	0.48	0.58

T2I-CompBench

Model	Color	Shape	Texture	2D-Spatial	3D-Spatial	Numeracy	Non-spatial Clip	Complex 3-in-1
SDXL	0.5879	0.4687	0.5299	0.2133	0.3566	0.4988	0.3119	0.3237
PixArt-alpha	0.6690	0.4927	0.6477	0.2064	0.3901	0.5058	0.3197	0.3433
SD3-Medium	0.8132	0.5885	0.7334	0.3200	0.4084	0.6174	0.3140	0.3771
DALL-E 3	0.7785	0.6205	0.7036	0.2865	0.3744	0.5880	0.3003	0.3773
Flux.1-dev	0.7572	0.5066	0.6300	0.2700	0.3992	0.6165	0.3065	0.3628
Janus-Pro-7B	0.5145	0.3323	0.4069	0.1566	0.2753	0.4406	0.3137	0.3806
CogView4-6B	0.7786	0.5880	0.6983	0.3075	0.3708	0.6626	0.3056	0.3869

中文文字准确率评测

Model	Precision	Recall	F1 Score	Pick@4
Kolors	0.6094	0.1886	0.2880	0.1633
CogView4-6B	0.6969	0.5532	0.6168	0.3265

推理模型

提示词优化

虽然 CogView4 系列模型都是通过长篇合成图像描述进行训练的，但我们强烈建议在文本生成图像之前，基于大语言模型进行提示词的重写操作，这将大大提高生成质量。

我们提供了一个示例脚本。我们建议您运行这个脚本，以实现对提示词对润色。请注意，CogView4 和 CogView3 模型的提示词优化使用的few shot不同。需要区分。

cd inference
python prompt_optimize.py --api_key "智谱AI API Key" --prompt {你的提示词} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus" --cogview_version "cogview4"

推理模型

以 BF16 的精度运行模型:

from diffusers import CogView4Pipeline
import torch

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")

# Open it for reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview4.png")

更多推理代码，可以参考：

用 BNB int4 加载 text encoder 代码，参考这里。
用 TorchAO int8 or int4 加载 text encoder & transformer 代码，参考这里。
使用 gradio 界面示例, 参考这里。

安装

git clone https://github.com/THUDM/CogView4
cd CogView4
git clone https://huggingface.co/THUDM/CogView4-6B
pip install -r inference/requirements.txt

运行

12G VRAM

MODE=1 python inference/gradio_web_demo.py

24G VRAM 32G RAM

MODE=2 python inference/gradio_web_demo.py

24G VRAM 64G RAM

MODE=3 python inference/gradio_web_demo.py

48G VRAM 64G RAM

MODE=4 python inference/gradio_web_demo.py

开源协议

本仓库代码和 CogView3 模型均采用 Apache 2.0 开源协议。

我们欢迎和感谢你贡献代码，你可以在这里查看贡献指南。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

CogView4 & CogView3 & CogView-3Plus

项目更新

项目计划

模型介绍

模型对比

显存占用

模型指标

DPG-Bench

GenEval

T2I-CompBench

中文文字准确率评测

推理模型

提示词优化

推理模型

安装

运行

开源协议

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

CogView4 & CogView3 & CogView-3Plus

项目更新

项目计划

模型介绍

模型对比

显存占用

模型指标

DPG-Bench

GenEval

T2I-CompBench

中文文字准确率评测

推理模型

提示词优化

推理模型

安装

运行

开源协议