Skip to content

THUDM/CogView4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CogView4 & CogView3 & CogView-3Plus

阅读中文版 日本語で読む

🤗 HuggingFace Space 🤖ModelScope Space 🛠️ZhipuAI MaaS(Faster)
👋 WeChat Community 📚 CogView3 Paper

showcase.png

Project Updates

  • 🔥🔥 2025/03/04: We've adapted and open-sourced the diffusers version of CogView-4 model, which has 6B parameters, supports native Chinese input, and Chinese text-to-image generation. You can try it online.
  • 2024/10/13: We've adapted and open-sourced the diffusers version of CogView-3Plus-3B model. You can try it online.
  • 2024/9/29: We've open-sourced CogView3 and CogView-3Plus-3B. CogView3 is a text-to-image system based on cascading diffusion, using a relay diffusion framework. CogView-3Plus is a series of newly developed text-to-image models based on Diffusion Transformer.

Project Plan

  • Diffusers workflow adaptation
  • Cog series fine-tuning kits (coming soon)
  • ControlNet models and training code

Model Introduction

Model Comparison

Model Name CogView4 CogView3-Plus-3B
Resolution 512 <= H, W <= 2048
H * W <= 2^{21}
H, W \mod 32 = 0
Inference Precision Only supports BF16, FP32
Encoder GLM-4-9B T5-XXL
Prompt Language Chinese, English English
Prompt Length Limit 1024 Tokens 224 Tokens
Download Links 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel

Memory Usage

DIT models are tested with BF16 precision and batchsize=4, with results shown in the table below:

Resolution enable_model_cpu_offload OFF enable_model_cpu_offload ON enable_model_cpu_offload ON
Text Encoder 4bit
512 * 512 33GB 20GB 13G
1280 * 720 35GB 20GB 13G
1024 * 1024 35GB 20GB 13G
1920 * 1280 39GB 20GB 14G
2048 * 2048 43GB 21GB 14G

Additionally, we recommend that your device has at least 32GB of RAM to prevent the process from being killed.

Model Metrics

We've tested on multiple benchmarks and achieved the following scores:

DPG-Bench

Model Overall Global Entity Attribute Relation Other
SDXL 74.65 83.27 82.43 80.91 86.76 80.41
PixArt-alpha 71.11 74.97 79.32 78.60 82.57 76.96
SD3-Medium 84.08 87.90 91.01 88.83 80.70 88.68
DALL-E 3 83.50 90.97 89.61 88.39 90.58 89.83
Flux.1-dev 83.79 85.80 86.79 89.98 90.04 89.90
Janus-Pro-7B 84.19 86.90 88.90 89.40 89.32 89.48
CogView4-6B 85.13 83.85 90.35 91.17 91.14 87.29

GenEval

Model Overall Single Obj. Two Obj. Counting Colors Position Color attribution
SDXL 0.55 0.98 0.74 0.39 0.85 0.15 0.23
PixArt-alpha 0.48 0.98 0.50 0.44 0.80 0.08 0.07
SD3-Medium 0.74 0.99 0.94 0.72 0.89 0.33 0.60
DALL-E 3 0.67 0.96 0.87 0.47 0.83 0.43 0.45
Flux.1-dev 0.66 0.98 0.79 0.73 0.77 0.22 0.45
Janus-Pro-7B 0.80 0.99 0.89 0.59 0.90 0.79 0.66
CogView4-6B 0.73 0.99 0.86 0.66 0.79 0.48 0.58

T2I-CompBench

Model Color Shape Texture 2D-Spatial 3D-Spatial Numeracy Non-spatial Clip Complex 3-in-1
SDXL 0.5879 0.4687 0.5299 0.2133 0.3566 0.4988 0.3119 0.3237
PixArt-alpha 0.6690 0.4927 0.6477 0.2064 0.3901 0.5058 0.3197 0.3433
SD3-Medium 0.8132 0.5885 0.7334 0.3200 0.4084 0.6174 0.3140 0.3771
DALL-E 3 0.7785 0.6205 0.7036 0.2865 0.3744 0.5880 0.3003 0.3773
Flux.1-dev 0.7572 0.5066 0.6300 0.2700 0.3992 0.6165 0.3065 0.3628
Janus-Pro-7B 0.5145 0.3323 0.4069 0.1566 0.2753 0.4406 0.3137 0.3806
CogView4-6B 0.7786 0.5880 0.6983 0.3075 0.3708 0.6626 0.3056 0.3869

Chinese Text Accuracy Evaluation

Model Precision Recall F1 Score Pick@4
Kolors 0.6094 0.1886 0.2880 0.1633
CogView4-6B 0.6969 0.5532 0.6168 0.3265

Inference Model

Prompt Optimization

Although CogView4 series models are trained with lengthy synthetic image descriptions, we strongly recommend using a large language model to rewrite prompts before text-to-image generation, which will greatly improve generation quality.

We provide an example script. We recommend running this script to refine your prompts. Note that CogView4 and CogView3 models use different few-shot examples for prompt optimization. They need to be distinguished.

cd inference
python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus" --cogview_version "cogview4"

Inference Model

Run the model with BF16 precision:

from diffusers import CogView4Pipeline
import torch

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")

# Open it for reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview4.png")

For more inference code, please check:

  1. For using BNB int4 to load text encoder and complete inference code annotations, check here.
  2. For using TorchAO int8 or int4 to load text encoder & transformer and complete inference code annotations, check here.
  3. For setting up a gradio GUI DEMO, check here.

Installation

git clone https://github.com/THUDM/CogView4
cd CogView4
git clone https://huggingface.co/THUDM/CogView4-6B
pip install -r inference/requirements.txt

Quickstart

12G VRAM

MODE=1 python inference/gradio_web_demo.py

24G VRAM 32G RAM

MODE=2 python inference/gradio_web_demo.py

24G VRAM 64G RAM

MODE=3 python inference/gradio_web_demo.py

48G VRAM 64G RAM

MODE=4 python inference/gradio_web_demo.py

License

The code in this repository and the CogView3 models are licensed under Apache 2.0.

We welcome and appreciate your code contributions. You can view the contribution guidelines here.

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages