Skip to content

Commit

Permalink
Added many features: DiP, BERT, EMA (the rest is detailed in the README)
Browse files Browse the repository at this point in the history
  • Loading branch information
GuyTevet committed Feb 12, 2025
1 parent 395f382 commit d178946
Show file tree
Hide file tree
Showing 36 changed files with 1,574 additions and 309 deletions.
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,11 @@ dmypy.json

save/
wandb/
t2m/
body_models/
glove/
slurm**/
*.out
*.err
*.slurm
.vscode/
191 changes: 191 additions & 0 deletions DiP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# DiP


Diffusion Planner (DiP) is an ultra-fast text-to-motion diffusion model. It is our newest version of [MDM](README.md)! It was published in the CLoSD paper [ICLR 2025 Spotlight]. To read more about it, check out the [CLoSD project page](https://guytevet.github.io/CLoSD-page/) and [code](https://github.com/GuyTevet/CLoSD).


![DiP](https://github.com/GuyTevet/mdm-page/raw/main/static/figures/dip_vis_caption_small.gif)




## Performance


![dip_spec](assets/dip_spec.png)


### Why is DiP so fast? Here's the TL;DR:


- DiP is autoregressive, it predicts the next 2 seconds of motion at each call.
- DiP uses only 10 diffusion steps (It performs well even with 5 steps).


## Results


The official results of MDM and DiP to cite in your paper:


![fixed_results](assets/fixed_results.png)


- Blue marks entries from the original paper that have been corrected.
- You can use [this](assets/fixed_results.tex) `.tex` file.




## Bibtex




```
MDM:
@inproceedings{
tevet2023human,
title={Human Motion Diffusion Model},
author={Guy Tevet and Sigal Raab and Brian Gordon and Yoni Shafir and Daniel Cohen-or and Amit Haim Bermano},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=SJ1kSyO2jwu}
}
DiP and CLoSD:
@article{tevet2024closd,
title={CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control},
author={Tevet, Guy and Raab, Sigal and Cohan, Setareh and Reda, Daniele and Luo, Zhengyi and Peng, Xue Bin and Bermano, Amit H and van de Panne, Michiel},
journal={arXiv preprint arXiv:2410.03441},
year={2024}
}
```






## Architecture


- DiP is a transformer decoder.
- It encode the text using a fixed [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert).
- It enables additional target location condition, that was used in [CLoSD](https://guytevet.github.io/CLoSD-page/) for object interaction.
- At each diffusion step, it gets the prefix (clean) and the prediction, noised to `t`.
- For the full implementation details, check out [the paper](https://arxiv.org/abs/2410.03441).


![dip_spec](assets/dip_arch_small.png)






## Setup


Follow the setup instructions of [MDM](README.md), then download the checkpoints and place them at `save/`


### Model Checkpoints


[DiP]() (For text-to-motion)


[DiP with target conditioning]() (For the CLoSD applications)


- **Note:** DiP code is also included in the [CLoSD code base](https://github.com/GuyTevet/CLoSD). If you would like to run the full CLoSD system you better use it instead.




## Demo


That would be awesome (if we had one😬). In case you want to create it and get eternal glory, the original [MDM Demo](https://replicate.com/daanelson/motion_diffusion_model) might be a good starting point.


## Generate


```shell
python -m sample.generate \
--model_path save/target_10steps_context20_predict40/model000200000.pt \
--autoregressive --guidance_param 7.5
```


- This will use prompts from the dataset.
- In case you want to use your own prompt, add `--text_prompt "A person throws a ball."`.
- In case you want to change the prompt on the fly, add `--dynamic_text_path assets/example_dynamic_text_prompts.txt`. Here, each line corresponds to a single prediction, i.e. two seconds of motion.
- **Note:** The initial prefix will still be sampled from the data. For example, if you ask the person to throw a ball and you happen to sample a sitting prefix, it will first need to get up before throwing.


**You may also define:**


* `--num_samples` (default is 10) / `--num_repetitions` (default is 3).
* `--device` id.
* `--seed` to sample different prompts.
* `--motion_length` (text-to-motion only) in seconds.


**Running those will get you:**


* `results.npy` file with text prompts and xyz positions of the generated animation
* `sample##_rep##.mp4` - a stick figure animation for each generated motion.


## Evaluate


The evaluation results can be found in the `.log` file of each checkpoint dir.
To reproduce it, run:


```shell
python -m eval.eval_humanml --model_path save/DiP_no-target_10steps_context20_predict40/model000600343.pt --autoregressive --guidance_param 7.5
```


**You may also define:**
* `--train_platform_type WandBPlatform` to log the results in WanDB.
* `--eval_mode mm_short` to calculate the multimodality metric.




## Train your own DiP


To reproduce DiP, run:


```shell
python -m train.train_mdm\
--save_dir save/my_humanml_DiP \
--dataset humanml --arch trans_dec --text_encoder_type bert \
--diffusion_steps 10 --context_len 20 --pred_len 40 \
--mask_frames --use_ema --autoregressive --gen_guidance_param 7.5
```


* **Recommended:** Add `--eval_during_training` and `--gen_during_training` to evaluate and generate motions for each saved checkpoint.
This will slow down training but will give you better monitoring.
* **Recommended:** Add `--use_ema` for Exponential Moving Average, and `--mask_frames` to fix a masking bug. Both improve performance.
* To add target conditioning for the CLoSD applications, add `--lambda_target_loc 1.`.
* Use `--device` to define GPU id.
* Use `--arch` to choose one of the architectures reported in the paper `{trans_enc, trans_dec, gru}` (`trans_enc` is the default).
* Use `--text_encoder_type` to choose the text encoder `{clip, bert}` (`clip` is default).
* Add `--train_platform_type {WandBPlatform, TensorboardPlatform}` to track results with either [WandB](https://wandb.ai/site/) or [Tensorboard](https://www.tensorflow.org/tensorboard).
58 changes: 55 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,25 @@ Performance improvement is due to an evaluation bug fix. BLUE marks fixed entrie
- You can use [this](assets/fixed_results.tex) `.tex` file.
- The fixed **KIT** results are available [here](https://github.com/GuyTevet/motion-diffusion-model/issues/211#issue-2369160290).


## [NEW] DiP: Ultra-fast Text-to-motion

### DiP is now part of the MDM code base!

### [Here's how to use it](DiP.md)

![DiP](https://github.com/GuyTevet/mdm-page/raw/main/static/figures/dip_vis_caption_small.gif)



## Bibtex
🔴🔴🔴**NOTE: MDM and MotionDiffuse are NOT the same paper!** For some reason, Google Scholar merged the two papers. The right way to cite MDM is:</span>

<!-- If you find this code useful in your research, please cite: -->

```
MDM:
@inproceedings{
tevet2023human,
title={Human Motion Diffusion Model},
Expand All @@ -40,10 +53,28 @@ booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=SJ1kSyO2jwu}
}
DiP and CLoSD:
@article{tevet2024closd,
title={CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control},
author={Tevet, Guy and Raab, Sigal and Cohan, Setareh and Reda, Daniele and Luo, Zhengyi and Peng, Xue Bin and Bermano, Amit H and van de Panne, Michiel},
journal={arXiv preprint arXiv:2410.03441},
year={2024}
}
```

## News

📢 **12/Feb/25** - Added many things:
* [The DiP model](DiP.md)
* MDM with DistilBERT text encoder (Add `--text_encoder_type bert`)
* `--gen_during_training` feature.
* `--mask_frames` bug fix.
* `--use_ema` Weight averaging using Exponential Moving Average.
* Dataset caching for faster loading (by default).
* `eval_humanml` script can be logged with WanDB.

📢 **29/Jan/25** - Added WandB support with `--train_platform_type WandBPlatform`.

📢 **15/Apr/24** - Released a [50 diffusion steps model](https://drive.google.com/file/d/1cfadR1eZ116TIdXK7qDX1RugAerEiJXr/view?usp=sharing) (instead of 1000 steps) which runs 20X faster 🤩🤩🤩 with comparable results.
Expand Down Expand Up @@ -388,10 +419,27 @@ The output will look like this (blue joints are from the input motion; orange we
<summary><b>Text to Motion</b></summary>

**HumanML3D**

To reproduce the original paper model, run:

```shell
python -m train.train_mdm --save_dir save/my_humanml_trans_enc_512 --dataset humanml
```

To reproduce MDM-50 steps, Run:

```shell
python -m train.train_mdm --save_dir save/my_humanml_trans_enc_512_50steps --dataset humanml --diffusion_steps 50 --mask_frames --use_ema
```

To reproduce MDM+DistilBERT, Run:

```shell
python -m train.train_mdm --save_dir save/my_humanml_trans_dec_bert_512 --dataset humanml --diffusion_steps 50 --arch trans_dec --text_encoder_type bert --mask_frames --use_ema
```

python -m train.train_mdm --save_dir save/humanml_trans_dec_bert_512_3 --dataset humanml --train_platform_type WandBPlatform --overwrite --eval_during_training --gen_during_training --diffusion_steps 50 --use_ema --arch trans_dec --text_encoder_type bert --mask_frames

**KIT**
```shell
python -m train.train_mdm --save_dir save/my_kit_trans_enc_512 --dataset kit
Expand All @@ -413,19 +461,23 @@ python -m train.train_mdm --save_dir save/my_name --dataset humanact12 --cond_ma
```
</details>


* **Recommended:** Add `--eval_during_training` and `--gen_during_training` to evaluate and generate motions for each saved checkpoint.
This will slow down training but will give you better monitoring.
* **Recommended:** Add `--use_ema` for Exponential Moving Average, and `--mask_frames` to fix a masking bug. Both improve performance.
* Use `--diffusion_steps 50` to train the faster model with less diffusion steps.
* Use `--device` to define GPU id.
* Use `--arch` to choose one of the architectures reported in the paper `{trans_enc, trans_dec, gru}` (`trans_enc` is default).
* Use `--text_encoder_type` to choose the text encoder `{clip, bert}` (`clip` is default).
* Add `--train_platform_type {WandBPlatform, TensorboardPlatform}` to track results with either [WandB](https://wandb.ai/site/) or [Tensorboard](https://www.tensorflow.org/tensorboard).
* Add `--eval_during_training` to run a short (90 minutes) evaluation for each saved checkpoint.
This will slow down training but will give you better monitoring.


## Evaluate

<details>
<summary><b>Text to Motion</b></summary>

* Takes about 20 hours (on a single GPU)
<!-- * Takes about 20 hours (on a single GPU) -->
* The output of this script for the pre-trained models (as was reported in the paper) is provided in the checkpoints zip file.

**HumanML3D**
Expand Down
Binary file added assets/dip_arch_small.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/dip_spec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions assets/example_dynamic_text_prompts.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
A person is walking.
A person is walking.
A person walks forward, bends down to pick something up off the ground.
A person walks forward, bends down to pick something up off the ground.
A person is getting up and perform jumping jacks.
A person is getting up and perform jumping jacks.
A person is running.
A person is running.
A person punches in a manner consistent with martial arts.
A person punches in a manner consistent with martial arts.
A person punches in a manner consistent with martial arts.
A person punches in a manner consistent with martial arts.
Binary file modified assets/fixed_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 7 additions & 3 deletions assets/fixed_results.tex
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

% add the following to main:
% \usepackage{xcolor}
% \usepackage{soul}
Expand All @@ -18,8 +17,13 @@
\midrule
Real & $0.512^{\pm.002}$ & $0.702^{\pm.002}$ & $0.797^{\pm.002}$ & $0.002^{\pm.000}$ & $2.97^{\pm.008}$ & $9.50^{\pm.065}$ & -\\

\ourmethod{} (paper model) & $0.418^{\pm.005}$ & $0.604^{\pm.005}$ & \hlfancy{beaublue}{${{0.707^{\pm.004}}}$} & $0.489^{\pm.025}$ & \hlfancy{beaublue}{${3.63^{\pm.023}}$} & $9.45^{\pm.066}$ & ${2.87^{\pm.111}}$ \\
MDM (paper model) & $0.418^{\pm.005}$ & $0.604^{\pm.005}$ & \hlfancy{beaublue}{${{0.707^{\pm.004}}}$} & $0.489^{\pm.025}$ & \hlfancy{beaublue}{${3.63^{\pm.023}}$} & $9.45^{\pm.066}$ & ${2.87^{\pm.111}}$ \\
MDM-50steps (20X faster) & $0.455^{\pm.006}$ & $0.645^{\pm.007}$ & {${{0.749^{\pm.006}}}$} & $0.489^{\pm.047}$ & {${3.33^{\pm.025}}$} & $9.92^{\pm.083}$ & ${2.29^{\pm.07}}$ \\
\;\;\; with DistilBERT & $0.491^{\pm.006}$ & $0.709^{\pm.006}$ & {${{0.815^{\pm.005}}}$} & $0.495^{\pm.041}$ & {${3.04^{\pm.016}}$} & $9.88^{\pm.098}$ & ${1.67^{\pm.12}}$ \\
\midrule

\ourmethod{}-50steps (20X faster) & $0.455^{\pm.006}$ & $0.645^{\pm.007}$ & {${{0.749^{\pm.006}}}$} & $0.489^{\pm.047}$ & {${3.33^{\pm.025}}$} & $9.92^{\pm.083}$ & ${2.29^{\pm.07}}$ \\
DiP & $0.458^{\pm.006}$ & $0.664^{\pm.005}$ & {${{0.768^{\pm.004}}}$} & $0.228^{\pm.027}$ & {${3.23^{\pm.019}}$} & $9.41^{\pm.067}$ & ${1.04^{\pm.08}}$ \\
\;\;\; with target cond & $0.452^{\pm.006}$ & $0.661^{\pm.006}$ & {${{0.772^{\pm.006}}}$} & $0.232^{\pm.029}$ & {${3.22^{\pm.019}}$} & $9.47^{\pm.108}$ & ${1.14^{\pm.05}}$ \\

\bottomrule
\end{tabular}
Loading

0 comments on commit d178946

Please sign in to comment.