DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

[arXiv] [Project Page] [Code]

Minglei Shi^1*, Ziyang Yuan^1*, Haotian Yang², Xintao Wang^2†, Mingwu Zheng², Xin Tao², Wenliang Zhao¹, Wenzhao Zheng¹, Jie Zhou¹,Jiwen Lu^1†, Pengfei Wan², Di Zhang², Kun Gai²
(*equal contribution, †corresponding author.)

¹Tsinghua University, ²Kuaishou Technology.

🔥 Updates

[2025.3.20]: Release the code of DiffMoE
[2025.3.19]: Release the project page of DiffMoE

📖 Introduction

TL;DR: DiffMoE is a dynamic MoE Transformer that outperforms 3× larger dense models in diffusion tasks, using global token pool and adaptive routing while keeping 1× parameter activation.

This repo contains:

🪐 A simple PyTorch implementation of Dense-DiT, EC-DiT, TC-DiT, DifffMoE

🛸 An DiffMoE training and evaluation script using PyTorch DDP

To-do list

training / inference scripts
huggingface ckpts

✨ Key Points

Token Accessibility and Dynamic Computation. (a) Token accessibility levels from token isolation to crosssample interaction. Colors represent tokens in different samples, ti indicates noise levels. (b) Performance-accessibility analysis across architectures. (c) Computational dynamics during diffusion sampling, showing adaptive computation from noise to image. (d) Class-wise computation allocation from hard (technical diagrams) to easy (natural photos) tasks. Results from DiffMoE-L-E16-Flow (700K).

DiffMoE Architecture Overview. DiffMoE flattens tokens into a batch-level global token pool, where each expert maintains a fixed training capacity of $C^{E_i}_{train} = 1$. During inference, a dynamic capacity predictor adaptively routes tokens across different sampling steps and conditions. Different colors denote tokens from distinct samples, while ti represents corresponding noise levels.

Preparation

Dataset

Download ImageNet dataset, and place it in your IMAGENET_PATH.

Installation

Download the code:

git clone https://github.com/KwaiVGI/DiffMoE.git 
cd DiffMoE

A suitable conda environment named diffmoe can be created and activated with:

conda env create -f environment.yaml
conda activate diffmoe

Pre-trained models are Coming soon, stay tuned !

Model Name	# Avg. Act. Params	Training Step	CFG	FID-50K↓	Inception Score↑
TC-DiT-L-E16-Flow	458M	700K	1.0	19.06	73.49
EC-DiT-L-E16-Flow	458M	700K	1.0	16.12	82.37
Dense-DiT-L-Flow	458M	700K	1.0	17.01	78.17
Dense-DiT-XL-Flow	675M	700K	1.0	14.77	86.82
DiffMoE-L-E16-Flow	454M	700K	1.0	14.41	88.19
Dense-DiT-XL-Flow	458M	7000K	1.0	9.47	115.58
DiffMoE-L-E8-Flow	458M	7000K	1.0	9.60	131.45
Dense-DiT-XL-DDPM	458M	7000K	1.0	9.62	123.19
DiffMoE-L-E8-DDPM	458M	7000K	1.0	9.17	131.10

Usage

Training

Script for the default setting:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 \
train.py --config ./config/000_DiffMoE_S_E16_Flow.yaml

Evaluation (ImageNet 256x256)

Evaluate DiffMoE:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --nnodes=1 --nproc_per_node=8 \
sample_ddp_feature.py --image-size 256 \
    --per-proc-batch-size 125 --num-fid-samples 50000 --cfg-scale 1.0 --num-sampling-steps 250 --sample-dir samples \
    --ckpt exps/EXPNAME/checkpoints/xxxxxx.pt

Acknowledgements

We thank Zihan, Qiu for helpful discussion. A large portion of codes in this repo is based on MAR, DiT, DeepSeekMoE

🌟 Citation

@misc{shi2025diffmoedynamictokenselection,
      title={DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers}, 
      author={Minglei Shi and Ziyang Yuan and Haotian Yang and Xintao Wang and Mingwu Zheng and Xin Tao and Wenliang Zhao and Wenzhao Zheng and Jie Zhou and Jiwen Lu and Pengfei Wan and Di Zhang and Kun Gai},
      year={2025},
      eprint={2503.14487},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14487}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
config		config
dataset		dataset
diffusion		diffusion
evaluation		evaluation
figs		figs
models		models
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
download.py		download.py
environment.yml		environment.yml
sample_ddp_feature.py		sample_ddp_feature.py
train.py		train.py
train.sh		train.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

[arXiv] [Project Page] [Code]

🔥 Updates

📖 Introduction

To-do list

✨ Key Points

Preparation

Dataset

Installation

Usage

Training

Evaluation (ImageNet 256x256)

Acknowledgements

🌟 Citation

About

Releases

Packages

Contributors 2

Languages

License

KwaiVGI/DiffMoE

Folders and files

Latest commit

History

Repository files navigation

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

[arXiv] [Project Page] [Code]

🔥 Updates

📖 Introduction

To-do list

✨ Key Points

Preparation

Dataset

Installation

Usage

Training

Evaluation (ImageNet 256x256)

Acknowledgements

🌟 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages