[CVPR 2025] Universal Actions for Enhanced Embodied Foundation Models

Jinliang Zheng*, Jianxiong Li*, Dongxiu Liu*, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan,

Introduction

we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundations models in extensive evaluations on various real-world and simu- lation robotic environments, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions

Citation & Contact

If you find this repo useful, please kindly cite us:

@misc{zheng2025universalactionsenhancedembodied,
      title={Universal Actions for Enhanced Embodied Foundation Models}, 
      author={Jinliang Zheng and Jianxiong Li and Dongxiu Liu and Yinan Zheng and Zhihao Wang and Zhonghong Ou and Yu Liu and Jingjing Liu and Ya-Qin Zhang and Xianyuan Zhan},
      year={2025},
      eprint={2501.10105},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.10105}, 
}

If you have any questions about the code, feel free to raise an issue or contact the author directly: Jinliang Zheng, Jianxiong Li

Quick Start

Please note that the following guidance is only for model deployment, please kindly refer to Train

Install Package and Requirements

git clone https://github.com/2toinf/UniAct.git
cd UniAct
pip install -r requirements.txt

Load models

Firstly download the pretrained UniAct from Model Zoo

import models.UniAct_V1
from timm.models import create_model
uniact_model = create_model("UniAct_05B_CodeBook_256_V1")

### firstly load the pretrained universal action extractor / vision backbone / codebook
uniact_model.load_state_dict(torch.load("model path here"), strict=False)

### Then load embodiment-specifc decoders
uniact_model.load_state_dict(torch.load("decoder path here"), strict=False)

Prepare the input

from datasets.utils import LLAVAOV_PREPROCESSOR, R18_PREPROCESSOR
from PIL import Image

proprios = None # defaultly disable the proprios, use it for ACT decoder(Need to be normalized)
language_instruction = "your language instruction here"
image_list = ["image-view1 path here", "image-view2 path here"]
img_np = []
img_tensor = []
for image in image_list:
    with Image.open(image) as img: 
        img = img.convert('RGB')
    img_np.append(np.asarray(img))
    img_tensor.append(R18_PREPROCESSOR(img))

img_np = np.stack(img_np),
img_tensor = torch.stack(img_tensor),


text = [LLAVAOV_PREPROCESSOR.apply_chat_template([
            {
                "role": "user",
                "content": [
                    {"type": "video"},
                    {"type": "text", "text":  language_instruction},
                ]
            }], add_generation_prompt=True)]

video = [np.expand_dims(img_np[0], axis=0)] # only use the primary view for extractor!
inputs = LLAVAOV_PREPROCESSOR(videos=video, text=text, return_tensors="pt", padding=True)


inputs = {'inputs': inputs.to('cuda', torch.bfloat16),
        'images': img_tensor.unsqueeze(0).to('cuda', torch.bfloat16),
    }
if proprios is not None: inputs['proprios'] = proprios.to('cuda', torch.bfloat16)

Infer the model

pred_action = uniact_model.infer(
    domain_name = "libero-1-rgb", # check the model_config.py for the domain_name
    **inputs
)

Note: Please remember to denormalize the 'pred_action', kindly check the action statics for AIRData and OXE

Model Zoo

Models	Description	ckpt	Action normalize method	Observation type	Avg Succ Rate
Basemodel	Params for Universal Action Extractor / Vision Backbone / Universal Action Codebook	hf_link	-	Static view	-
Libero-MLP-Decoder	Params for MLP decoder on Libero	hf_link	mean-std	Static view	61.3%
Bridge-MLP-Decoder	Params for MLP decoder on Bridge	hf_link	mean-std	Static view	63.3%

As we haven't access the performance of other decoder heads, we will not release them. If you have any questions about this, please feel free to contact us.

Evaluation on Libero

Installation

Please follow the guide in the official repo to install the LIBERO simulation.

Reproduce the results

LIBERO (MLP Head)

You can directly run the following command by replacing YOUR_BASEMODEL_CKPT_PATH and YOUR_HEAD_CKPT_PATH as your base model and head ckpt pathes, e.g., /data/UniAct/basemodel.pt and /data/UniAct/libero_mlp.pt:

python eval/libero/run_uniact_libero_eval.py \
    --base_path YOUR_BASEMODEL_CKPT_PATH \
    --head_path YOUR_HEAD_CKPT_PATH \
    --num_episodes 20 \

Training Guidance

Firstly install the required packages for training, kindly refer to train/requirements.txt

1. Prepare the Data

Firstly download OXE dataset(tfds files) from the official repo
Fill the file path in dataset.py

# set this if you store the files in s3 ceph
S3Path = ''
# set this if you store the files in local machine
LOCAL_OXE = ''

2. Prepare the model

As we have refined the data and there may be some conflict to your own data. Please carefully fill the model_config: Fill the model settings in model_config.py. Currently support decoders:

ACT decoder (Refer to ACT_decoder.py for specifc name)
MLP decoder (Refer to MLP_decoder.py for specifc name)

3. Run the following script

srun -p mozi-S1 -n8  --gres=gpu:8  --ntasks-per-node=8 \
    python -u train/slurm_deepspeed_train.py \
        --model UniAct_05B_CodeBook_256_V1_Pretrain \
        --recipe oxe_magic_soup \
        --iters 1000000 \
        --start_iters 0 \
        --initial_t 2.0 \
        --final_t 0.5 \
        --batch-size 32 \
        --lr 1e-5 \
        --grad_accumulation_steps 1 \
        --output_dir exp/pretrain \
        --save_interval 10000 \
        --precision bf16

Fast-Adapt to your embodiment

We recommend you to use 'AIRData' as the data engine to train UniAct on your own embodiment. And that is what we do when we train UniAct on Libero! It may require some data reconstruction on your own dataset. We provide the data processing script for Libero as an example. Kindly refer to Libero_hdf52jpg.py. The image files structures should be as follow:

|-- traj-1
|  `-- frame-0.jpg
|  `-- frame-1.jpg
|  `-- frame-2.jpg
|  `-- ...
|  `-- frame-41.jpg
|-- traj-2
|  `-- frame-0.jpg
|  `-- frame-1.jpg
|  `-- frame-2.jpg
|  `-- ...
|  `-- frame-46.jpg
...

Data reconstruction

After you transfer the data into jpg format, you need to follow the following instructions to adapt it to the codebase:

Construct a meta file(.pkl) as following structures: . Here is a example: Libero.pkl

|-- 
|  `-- path: 'traj-1'
|  `-- length: 41   
|  `-- instruction: 'pick up the red cup'
|  `-- action: np.ndarray with shape(41, dim_action)
|  `-- proprios: np.ndarray with shape(41, dim_proprio)
|-- 
|  `-- path: 'traj-2'
|  `-- length: 46   
|  `-- instruction: '...'
|  `-- action: np.ndarray with shape(46, dim_action)
|  `-- proprios: np.ndarray with shape(46, dim_proprio)
|...

Modify the config.py please add your data meta infos follow the data structures in the file
Modify the mixture.py
Modify the model_config.py Please choose one decoder head for your embodiment and revise the file, currently support:
- MLP decoder
- ACT decoder

Run the following script

srun -p mozi-S1 -n8  --gres=gpu:8  --ntasks-per-node=8   \
    python -u slurm_deepspeed_train.py \
        --model UniAct_05B_CodeBook_256_V1_For_Fast_Adaptation \
        --recipe ACT-Libero \
        --iters 1000000 \
        --start_iters 0 \
        --batch-size 32 \
        --lr 1e-4 \
        --grad_accumulation_steps 1 \
        --output_dir exp/Libero \
        --save_interval 10000 \
        --port 12345 \
        --seed 178945 \
        --precision bf16 \
        --resume "path of pretrained basemodel"

Full-train UniAct on data-recipe in the paper

prepare the OXE dataset and AIRData following the above instruction
set UniAct-1.0 as the data-recipe in the script
set UniAct_05B_CodeBook_256_V1_Pretrain as the model in the script

📆 TODO

Release training codebase.
Release code for depolyment.
Release model checkpoints.
Release training guidance.

Acknowledgement

This work is built upon the huggingface and llava-one-vision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2025] Universal Actions for Enhanced Embodied Foundation Models

Introduction

Citation & Contact

Quick Start

Install Package and Requirements

Load models

Prepare the input

Infer the model

Model Zoo

Evaluation on Libero

Installation

Reproduce the results

Training Guidance

1. Prepare the Data

2. Prepare the model

3. Run the following script

Fast-Adapt to your embodiment

Data reconstruction

Run the following script

Full-train UniAct on data-recipe in the paper

Acknowledgement

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
eval		eval
models		models
static		static
train		train
.gitignore		.gitignore
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
slurm_deepspeed_train.py		slurm_deepspeed_train.py
utils.py		utils.py

2toinf/UniAct

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2025] Universal Actions for Enhanced Embodied Foundation Models

Introduction

Citation & Contact

Quick Start

Install Package and Requirements

Load models

Prepare the input

Infer the model

Model Zoo

Evaluation on Libero

Installation

Reproduce the results

Training Guidance

1. Prepare the Data

2. Prepare the model

3. Run the following script

Fast-Adapt to your embodiment

Data reconstruction

Run the following script

Full-train UniAct on data-recipe in the paper

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages