Added many features: DiP, BERT, EMA (the rest is detailed in the README)

GuyTevet · Feb 12, 2025 · d178946 · d178946
1 parent 395f382
commit d178946
Show file tree

Hide file tree

Showing 36 changed files with 1,574 additions and 309 deletions.
diff --git a/.gitignore b/.gitignore
@@ -130,3 +130,11 @@ dmypy.json
 
 save/
 wandb/
+t2m/
+body_models/
+glove/
+slurm**/
+*.out
+*.err
+*.slurm
+.vscode/
diff --git a/DiP.md b/DiP.md
@@ -0,0 +1,191 @@
+# DiP
+
+
+Diffusion Planner (DiP) is an ultra-fast text-to-motion diffusion model. It is our newest version of [MDM](README.md)! It was published in the CLoSD paper [ICLR 2025 Spotlight]. To read more about it, check out the [CLoSD project page](https://guytevet.github.io/CLoSD-page/) and [code](https://github.com/GuyTevet/CLoSD).
+
+
+![DiP](https://github.com/GuyTevet/mdm-page/raw/main/static/figures/dip_vis_caption_small.gif)
+
+
+
+
+## Performance
+
+
+![dip_spec](assets/dip_spec.png)
+
+
+### Why is DiP so fast? Here's the TL;DR:
+
+
+- DiP is autoregressive, it predicts the next 2 seconds of motion at each call.
+- DiP uses only 10 diffusion steps (It performs well even with 5 steps).
+
+
+## Results
+
+
+The official results of MDM and DiP to cite in your paper:
+
+
+![fixed_results](assets/fixed_results.png)
+
+
+- Blue marks entries from the original paper that have been corrected.
+- You can use [this](assets/fixed_results.tex) `.tex` file.
+
+
+
+
+## Bibtex
+
+
+
+
+```
+MDM:
+
+
+@inproceedings{
+tevet2023human,
+title={Human Motion Diffusion Model},
+author={Guy Tevet and Sigal Raab and Brian Gordon and Yoni Shafir and Daniel Cohen-or and Amit Haim Bermano},
+booktitle={The Eleventh International Conference on Learning Representations },
+year={2023},
+url={https://openreview.net/forum?id=SJ1kSyO2jwu}
+}
+
+
+DiP and CLoSD:
+
+
+@article{tevet2024closd,
+ title={CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control},
+ author={Tevet, Guy and Raab, Sigal and Cohan, Setareh and Reda, Daniele and Luo, Zhengyi and Peng, Xue Bin and Bermano, Amit H and van de Panne, Michiel},
+ journal={arXiv preprint arXiv:2410.03441},
+ year={2024}
+}
+```
+
+
+
+
+
+
+## Architecture
+
+
+- DiP is a transformer decoder.
+- It encode the text using a fixed [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert).
+- It enables additional target location condition, that was used in [CLoSD](https://guytevet.github.io/CLoSD-page/) for object interaction.
+- At each diffusion step, it gets the prefix (clean) and the prediction, noised to `t`.
+- For the full implementation details, check out [the paper](https://arxiv.org/abs/2410.03441).
+
+
+![dip_spec](assets/dip_arch_small.png)
+
+
+
+
+
+
+## Setup
+
+
+Follow the setup instructions of [MDM](README.md), then download the checkpoints and place them at `save/`
+
+
+### Model Checkpoints
+
+
+[DiP]() (For text-to-motion)
+
+
+[DiP with target conditioning]() (For the CLoSD applications)
+
+
+- **Note:** DiP code is also included in the [CLoSD code base](https://github.com/GuyTevet/CLoSD). If you would like to run the full CLoSD system you better use it instead.
+
+
+
+
+## Demo
+
+
+That would be awesome (if we had one😬). In case you want to create it and get eternal glory, the original [MDM Demo](https://replicate.com/daanelson/motion_diffusion_model) might be a good starting point.
+
+
+## Generate
+
+
+```shell
+python -m sample.generate \
+   --model_path save/target_10steps_context20_predict40/model000200000.pt \
+   --autoregressive --guidance_param 7.5
+```
+
+
+- This will use prompts from the dataset.
+- In case you want to use your own prompt, add `--text_prompt "A person throws a ball."`.
+- In case you want to change the prompt on the fly, add `--dynamic_text_path assets/example_dynamic_text_prompts.txt`. Here, each line corresponds to a single prediction, i.e. two seconds of motion.
+- **Note:** The initial prefix will still be sampled from the data. For example, if you ask the person to throw a ball and you happen to sample a sitting prefix, it will first need to get up before throwing.
+
+
+**You may also define:**
+
+
+* `--num_samples` (default is 10) / `--num_repetitions` (default is 3).
+* `--device` id.
+* `--seed` to sample different prompts.
+* `--motion_length` (text-to-motion only) in seconds.
+
+
+**Running those will get you:**
+
+
+* `results.npy` file with text prompts and xyz positions of the generated animation
+* `sample##_rep##.mp4` - a stick figure animation for each generated motion.
+
+
+## Evaluate
+
+
+The evaluation results can be found in the `.log` file of each checkpoint dir.
+To reproduce it, run:
+
+
+```shell
+python -m eval.eval_humanml --model_path save/DiP_no-target_10steps_context20_predict40/model000600343.pt  --autoregressive --guidance_param 7.5
+```
+
+
+**You may also define:**
+* `--train_platform_type WandBPlatform` to log the results in WanDB.
+* `--eval_mode mm_short` to calculate the multimodality metric.
+
+
+
+
+## Train your own DiP
+
+
+To reproduce DiP, run:
+
+
+```shell
+python -m train.train_mdm\
+--save_dir save/my_humanml_DiP \
+--dataset humanml --arch trans_dec --text_encoder_type bert \
+--diffusion_steps 10 --context_len 20 --pred_len 40 \
+--mask_frames --use_ema --autoregressive --gen_guidance_param 7.5
+```
+
+
+* **Recommended:** Add `--eval_during_training` and `--gen_during_training` to evaluate and generate motions for each saved checkpoint.
+ This will slow down training but will give you better monitoring.
+* **Recommended:** Add `--use_ema` for Exponential Moving Average, and `--mask_frames` to fix a masking bug. Both improve performance.
+* To add target conditioning for the CLoSD applications, add `--lambda_target_loc 1.`.
+* Use `--device` to define GPU id.
+* Use `--arch` to choose one of the architectures reported in the paper `{trans_enc, trans_dec, gru}` (`trans_enc` is the default).
+* Use `--text_encoder_type` to choose the text encoder `{clip, bert}` (`clip` is default).
+* Add `--train_platform_type {WandBPlatform, TensorboardPlatform}` to track results with either [WandB](https://wandb.ai/site/) or [Tensorboard](https://www.tensorflow.org/tensorboard).
diff --git a/README.md b/README.md
@@ -26,12 +26,25 @@ Performance improvement is due to an evaluation bug fix. BLUE marks fixed entrie
 - You can use [this](assets/fixed_results.tex) `.tex` file.
 - The fixed **KIT** results are available [here](https://github.com/GuyTevet/motion-diffusion-model/issues/211#issue-2369160290).
 
+
+## [NEW] DiP: Ultra-fast Text-to-motion
+
+### DiP is now part of the MDM code base!
+
+### [Here's how to use it](DiP.md)
+
+![DiP](https://github.com/GuyTevet/mdm-page/raw/main/static/figures/dip_vis_caption_small.gif)
+
+
+
 ## Bibtex
 🔴🔴🔴**NOTE: MDM and MotionDiffuse are NOT the same paper!** For some reason, Google Scholar merged the two papers. The right way to cite MDM is:</span>
 
 <!-- If you find this code useful in your research, please cite: -->
 
 ```
+MDM:
+
 @inproceedings{
 tevet2023human,
 title={Human Motion Diffusion Model},
@@ -40,10 +53,28 @@ booktitle={The Eleventh International Conference on Learning Representations },
 year={2023},
 url={https://openreview.net/forum?id=SJ1kSyO2jwu}
 }
+
+DiP and CLoSD:
+
+@article{tevet2024closd,
+  title={CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control},
+  author={Tevet, Guy and Raab, Sigal and Cohan, Setareh and Reda, Daniele and Luo, Zhengyi and Peng, Xue Bin and Bermano, Amit H and van de Panne, Michiel},
+  journal={arXiv preprint arXiv:2410.03441},
+  year={2024}
+}
 ```
 
 ## News
 
+📢 **12/Feb/25** - Added many things:
+  * [The DiP model](DiP.md)
+  * MDM with DistilBERT text encoder (Add `--text_encoder_type bert`)
+  * `--gen_during_training` feature.
+  * `--mask_frames` bug fix.
+  * `--use_ema` Weight averaging using Exponential Moving Average.
+  * Dataset caching for faster loading (by default).
+  * `eval_humanml` script can be logged with WanDB.
+
 📢 **29/Jan/25** - Added WandB support with `--train_platform_type WandBPlatform`.
 
 📢 **15/Apr/24** - Released a [50 diffusion steps model](https://drive.google.com/file/d/1cfadR1eZ116TIdXK7qDX1RugAerEiJXr/view?usp=sharing) (instead of 1000 steps) which runs 20X faster 🤩🤩🤩 with comparable results.
@@ -388,10 +419,27 @@ The output will look like this (blue joints are from the input motion; orange we
   <summary><b>Text to Motion</b></summary>
 
 **HumanML3D**
+
+To reproduce the original paper model, run:
+
 ```shell
 python -m train.train_mdm --save_dir save/my_humanml_trans_enc_512 --dataset humanml
 ```
 
+To reproduce MDM-50 steps, Run:
+
+```shell
+python -m train.train_mdm --save_dir save/my_humanml_trans_enc_512_50steps --dataset humanml --diffusion_steps 50 --mask_frames --use_ema
+```
+
+To reproduce MDM+DistilBERT, Run:
+
+```shell
+python -m train.train_mdm --save_dir save/my_humanml_trans_dec_bert_512 --dataset humanml --diffusion_steps 50 --arch trans_dec --text_encoder_type bert --mask_frames --use_ema
+```
+
+python -m train.train_mdm --save_dir save/humanml_trans_dec_bert_512_3 --dataset humanml --train_platform_type WandBPlatform --overwrite --eval_during_training --gen_during_training --diffusion_steps 50 --use_ema --arch trans_dec --text_encoder_type bert --mask_frames
+
 **KIT**
 ```shell
 python -m train.train_mdm --save_dir save/my_kit_trans_enc_512 --dataset kit
@@ -413,19 +461,23 @@ python -m train.train_mdm --save_dir save/my_name --dataset humanact12 --cond_ma
 ```
 </details>
 
+
+* **Recommended:** Add `--eval_during_training` and `--gen_during_training` to evaluate and generate motions for each saved checkpoint. 
+  This will slow down training but will give you better monitoring.
+* **Recommended:** Add `--use_ema` for Exponential Moving Average, and `--mask_frames` to fix a masking bug. Both improve performance.
 * Use `--diffusion_steps 50` to train the faster model with less diffusion steps.
 * Use `--device` to define GPU id.
 * Use `--arch` to choose one of the architectures reported in the paper `{trans_enc, trans_dec, gru}` (`trans_enc` is default).
+* Use `--text_encoder_type` to choose the text encoder `{clip, bert}` (`clip` is default).
 * Add `--train_platform_type {WandBPlatform, TensorboardPlatform}` to track results with either [WandB](https://wandb.ai/site/) or [Tensorboard](https://www.tensorflow.org/tensorboard).
-* Add `--eval_during_training` to run a short (90 minutes) evaluation for each saved checkpoint. 
-  This will slow down training but will give you better monitoring.
+
 
 ## Evaluate
 
 <details>
   <summary><b>Text to Motion</b></summary>
 
-* Takes about 20 hours (on a single GPU)
+<!-- * Takes about 20 hours (on a single GPU) -->
 * The output of this script for the pre-trained models (as was reported in the paper) is provided in the checkpoints zip file.
 
 **HumanML3D**

diff --git a/assets/dip_arch_small.png b/assets/dip_arch_small.png
diff --git a/assets/dip_spec.png b/assets/dip_spec.png
diff --git a/assets/example_dynamic_text_prompts.txt b/assets/example_dynamic_text_prompts.txt
@@ -0,0 +1,12 @@
+A person is walking.
+A person is walking.
+A person walks forward, bends down to pick something up off the ground.
+A person walks forward, bends down to pick something up off the ground.
+A person is getting up and perform jumping jacks.
+A person is getting up and perform jumping jacks.
+A person is running.
+A person is running.
+A person punches in a manner consistent with martial arts.
+A person punches in a manner consistent with martial arts.
+A person punches in a manner consistent with martial arts.
+A person punches in a manner consistent with martial arts.
diff --git a/assets/fixed_results.png b/assets/fixed_results.png
diff --git a/assets/fixed_results.tex b/assets/fixed_results.tex
@@ -1,4 +1,3 @@
-
 % add the following to main:
 % \usepackage{xcolor}
 % \usepackage{soul}
@@ -18,8 +17,13 @@
 \midrule
 Real & $0.512^{\pm.002}$ & $0.702^{\pm.002}$ & $0.797^{\pm.002}$ & $0.002^{\pm.000}$ & $2.97^{\pm.008}$ & $9.50^{\pm.065}$ & -\\ 
 
-\ourmethod{} (paper model) & $0.418^{\pm.005}$ & $0.604^{\pm.005}$ & \hlfancy{beaublue}{${{0.707^{\pm.004}}}$} & $0.489^{\pm.025}$ & \hlfancy{beaublue}{${3.63^{\pm.023}}$} & $9.45^{\pm.066}$ & ${2.87^{\pm.111}}$  \\
+MDM (paper model) & $0.418^{\pm.005}$ & $0.604^{\pm.005}$ & \hlfancy{beaublue}{${{0.707^{\pm.004}}}$} & $0.489^{\pm.025}$ & \hlfancy{beaublue}{${3.63^{\pm.023}}$} & $9.45^{\pm.066}$ & ${2.87^{\pm.111}}$  \\
+MDM-50steps (20X faster) & $0.455^{\pm.006}$ & $0.645^{\pm.007}$ & {${{0.749^{\pm.006}}}$} & $0.489^{\pm.047}$ & {${3.33^{\pm.025}}$} & $9.92^{\pm.083}$ & ${2.29^{\pm.07}}$  \\
+\;\;\; with DistilBERT & $0.491^{\pm.006}$ & $0.709^{\pm.006}$ & {${{0.815^{\pm.005}}}$} & $0.495^{\pm.041}$ & {${3.04^{\pm.016}}$} & $9.88^{\pm.098}$ & ${1.67^{\pm.12}}$  \\
+\midrule
 
-\ourmethod{}-50steps (20X faster) & $0.455^{\pm.006}$ & $0.645^{\pm.007}$ & {${{0.749^{\pm.006}}}$} & $0.489^{\pm.047}$ & {${3.33^{\pm.025}}$} & $9.92^{\pm.083}$ & ${2.29^{\pm.07}}$  \\
+DiP  & $0.458^{\pm.006}$ & $0.664^{\pm.005}$ & {${{0.768^{\pm.004}}}$} & $0.228^{\pm.027}$ & {${3.23^{\pm.019}}$} & $9.41^{\pm.067}$ & ${1.04^{\pm.08}}$  \\
+\;\;\; with target cond & $0.452^{\pm.006}$ & $0.661^{\pm.006}$ & {${{0.772^{\pm.006}}}$} & $0.232^{\pm.029}$ & {${3.22^{\pm.019}}$} & $9.47^{\pm.108}$ & ${1.14^{\pm.05}}$  \\
 
+\bottomrule
 \end{tabular}