Skip to content

KwaiVGI/DiffMoE

Repository files navigation

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Minglei Shi1*, Ziyang Yuan1*, Haotian Yang2, Xintao Wang2†, Mingwu Zheng2, Xin Tao2, Wenliang Zhao1, Wenzhao Zheng1, Jie Zhou1,Jiwen Lu1†, Pengfei Wan2, Di Zhang2, Kun Gai2
(*equal contribution, †corresponding author.)

1Tsinghua University, 2Kuaishou Technology.

🔥 Updates

  • [2025.3.20]: Release the code of DiffMoE
  • [2025.3.19]: Release the project page of DiffMoE

📖 Introduction

TL;DR: DiffMoE is a dynamic MoE Transformer that outperforms 3× larger dense models in diffusion tasks, using global token pool and adaptive routing while keeping 1× parameter activation.

This repo contains:

To-do list

  • training / inference scripts
  • huggingface ckpts

✨ Key Points

Token Accessibility and Dynamic Computation. (a) Token accessibility levels from token isolation to crosssample interaction. Colors represent tokens in different samples, ti indicates noise levels. (b) Performance-accessibility analysis across architectures. (c) Computational dynamics during diffusion sampling, showing adaptive computation from noise to image. (d) Class-wise computation allocation from hard (technical diagrams) to easy (natural photos) tasks. Results from DiffMoE-L-E16-Flow (700K).

DiffMoE Architecture Overview. DiffMoE flattens tokens into a batch-level global token pool, where each expert maintains a fixed training capacity of $C^{E_i}_{train} = 1$. During inference, a dynamic capacity predictor adaptively routes tokens across different sampling steps and conditions. Different colors denote tokens from distinct samples, while ti represents corresponding noise levels.

Preparation

Dataset

Download ImageNet dataset, and place it in your IMAGENET_PATH.

Installation

Download the code:

git clone https://github.com/KwaiVGI/DiffMoE.git 
cd DiffMoE

A suitable conda environment named diffmoe can be created and activated with:

conda env create -f environment.yaml
conda activate diffmoe

Pre-trained models are Coming soon, stay tuned !

Model Name # Avg. Act. Params Training Step CFG FID-50K↓ Inception Score↑
TC-DiT-L-E16-Flow 458M 700K 1.0 19.06 73.49
EC-DiT-L-E16-Flow 458M 700K 1.0 16.12 82.37
Dense-DiT-L-Flow 458M 700K 1.0 17.01 78.17
Dense-DiT-XL-Flow 675M 700K 1.0 14.77 86.82
DiffMoE-L-E16-Flow 454M 700K 1.0 14.41 88.19
Dense-DiT-XL-Flow 458M 7000K 1.0 9.47 115.58
DiffMoE-L-E8-Flow 458M 7000K 1.0 9.60 131.45
Dense-DiT-XL-DDPM 458M 7000K 1.0 9.62 123.19
DiffMoE-L-E8-DDPM 458M 7000K 1.0 9.17 131.10

Usage

Training

Script for the default setting:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 \
train.py --config ./config/000_DiffMoE_S_E16_Flow.yaml

Evaluation (ImageNet 256x256)

Evaluate DiffMoE:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --nnodes=1 --nproc_per_node=8 \
sample_ddp_feature.py --image-size 256 \
    --per-proc-batch-size 125 --num-fid-samples 50000 --cfg-scale 1.0 --num-sampling-steps 250 --sample-dir samples \
    --ckpt exps/EXPNAME/checkpoints/xxxxxx.pt

Acknowledgements

We thank Zihan, Qiu for helpful discussion. A large portion of codes in this repo is based on MAR, DiT, DeepSeekMoE

🌟 Citation

@misc{shi2025diffmoedynamictokenselection,
      title={DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers}, 
      author={Minglei Shi and Ziyang Yuan and Haotian Yang and Xintao Wang and Mingwu Zheng and Xin Tao and Wenliang Zhao and Wenzhao Zheng and Jie Zhou and Jiwen Lu and Pengfei Wan and Di Zhang and Kun Gai},
      year={2025},
      eprint={2503.14487},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14487}, 
}

About

PyTorch implementation of DiffMoE, TC-DiT, EC-DiT and Dense DiT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published