(CVPR 2025) MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
Bizhu Wu · Jinheng Xie · Keming Shen · Zhe Kong
Jianfeng Ren* · Ruibin Bai · Rong Qu · Linlin Shen*
*Corresponding Authors
MG-MotionLLM can address diverse motion-relevant tasks at multiple granularities by giving different instructions in a unified manner.
- coarse-grained: e.g. text-to-motion and motion captioning (upper block)
- fine-grained: e.g. motion-to-detailed text and motion localization (bottom block).
To achieve this, we propose multi-granularity training scheme with novel auxiliary tasks captures motion-related features at different levels, improving understanding across a wide range of tasks. Specifically, we pretrain the model with a total of 28 distinct motion-relevant tasks, including 12 existing classical coarse-grained tasks and 16 newly proposed fine-grained ones. Here, we display examples of prompt templates for a part of tasks used during training.
We display some novel applications of our MG-MotionLLM.
- text-driven fine-grained motion editing: Temporal Editing (left), Spatial Editing (middle), and Spatial-Temporal Editing (right).
- fine-grained captioning of both whole (up) and partial (bottom) motion sequences, and motion localization via fine-grained textual description (middle).
For code, weights, etc, please see here.
If you use our code in your research, kindly cite our work:
@article{wu2025mg,
title={MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities},
author={Wu, Bizhu and Xie, Jinheng and Shen, Keming and Kong, Zhe and Ren, Jianfeng and Bai, Ruibin and Qu, Rong and Shen, Linlin},
journal={arXiv preprint arXiv:2504.02478},
year={2025}
}