[arXiv] [Project Page] [Code]
Minglei Shi1*, Ziyang Yuan1*, Haotian Yang2, Xintao Wang2†, Mingwu Zheng2, Xin Tao2, Wenliang Zhao1, Wenzhao Zheng1, Jie Zhou1,Jiwen Lu1†, Pengfei Wan2, Di Zhang2, Kun Gai2
(*equal contribution, †corresponding author.)
1Tsinghua University, 2Kuaishou Technology.
- [2025.3.20]: Release the code of DiffMoE
- [2025.3.19]: Release the project page of DiffMoE
TL;DR: DiffMoE is a dynamic MoE Transformer that outperforms 3× larger dense models in diffusion tasks, using global token pool and adaptive routing while keeping 1× parameter activation.
This repo contains:
- 🛸 An DiffMoE training and evaluation script using PyTorch DDP
- training / inference scripts
- huggingface ckpts
Token Accessibility and Dynamic Computation. (a) Token accessibility levels from token isolation to crosssample interaction. Colors represent tokens in different samples, ti indicates noise levels. (b) Performance-accessibility analysis across architectures. (c) Computational dynamics during diffusion sampling, showing adaptive computation from noise to image. (d) Class-wise computation allocation from hard (technical diagrams) to easy (natural photos) tasks. Results from DiffMoE-L-E16-Flow (700K).
DiffMoE Architecture Overview. DiffMoE flattens tokens into a batch-level global token pool, where each expert maintains a fixed training capacity of
Download ImageNet dataset, and place it in your IMAGENET_PATH
.
Download the code:
git clone https://github.com/KwaiVGI/DiffMoE.git
cd DiffMoE
A suitable conda environment named diffmoe
can be created and activated with:
conda env create -f environment.yaml
conda activate diffmoe
Pre-trained models are Coming soon, stay tuned !
Model Name | # Avg. Act. Params | Training Step | CFG | FID-50K↓ | Inception Score↑ |
---|---|---|---|---|---|
TC-DiT-L-E16-Flow | 458M | 700K | 1.0 | 19.06 | 73.49 |
EC-DiT-L-E16-Flow | 458M | 700K | 1.0 | 16.12 | 82.37 |
Dense-DiT-L-Flow | 458M | 700K | 1.0 | 17.01 | 78.17 |
Dense-DiT-XL-Flow | 675M | 700K | 1.0 | 14.77 | 86.82 |
DiffMoE-L-E16-Flow | 454M | 700K | 1.0 | 14.41 | 88.19 |
Dense-DiT-XL-Flow | 458M | 7000K | 1.0 | 9.47 | 115.58 |
DiffMoE-L-E8-Flow | 458M | 7000K | 1.0 | 9.60 | 131.45 |
Dense-DiT-XL-DDPM | 458M | 7000K | 1.0 | 9.62 | 123.19 |
DiffMoE-L-E8-DDPM | 458M | 7000K | 1.0 | 9.17 | 131.10 |
Script for the default setting:
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 \
train.py --config ./config/000_DiffMoE_S_E16_Flow.yaml
Evaluate DiffMoE:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --nnodes=1 --nproc_per_node=8 \
sample_ddp_feature.py --image-size 256 \
--per-proc-batch-size 125 --num-fid-samples 50000 --cfg-scale 1.0 --num-sampling-steps 250 --sample-dir samples \
--ckpt exps/EXPNAME/checkpoints/xxxxxx.pt
We thank Zihan, Qiu for helpful discussion. A large portion of codes in this repo is based on MAR, DiT, DeepSeekMoE
@misc{shi2025diffmoedynamictokenselection,
title={DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers},
author={Minglei Shi and Ziyang Yuan and Haotian Yang and Xintao Wang and Mingwu Zheng and Xin Tao and Wenliang Zhao and Wenzhao Zheng and Jie Zhou and Jiwen Lu and Pengfei Wan and Di Zhang and Kun Gai},
year={2025},
eprint={2503.14487},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.14487},
}