[Project Page] [Paper]
Jinliang Zheng*, Jianxiong Li*, Dongxiu Liu*, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan,
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundations models in extensive evaluations on various real-world and simu- lation robotic environments, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions
- If you find this repo useful, please kindly cite us:
@misc{zheng2025universalactionsenhancedembodied,
title={Universal Actions for Enhanced Embodied Foundation Models},
author={Jinliang Zheng and Jianxiong Li and Dongxiu Liu and Yinan Zheng and Zhihao Wang and Zhonghong Ou and Yu Liu and Jingjing Liu and Ya-Qin Zhang and Xianyuan Zhan},
year={2025},
eprint={2501.10105},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2501.10105},
}
- If you have any questions about the code, feel free to raise an issue or contact the author directly: Jinliang Zheng, Jianxiong Li
Please note that the following guidance is only for model deployment, please kindly refer to Train
git clone https://github.com/2toinf/UniAct-Preview.git
cd UniAct-Preview
pip install -r requirements.txt
Firstly download the pretrained UniAct from Model Zoo
import models.UniAct_V1
from timm.models import create_model
uniact_model = create_model("UniAct_05B_CodeBook_256_V1")
### firstly load the pretrained universal action extractor / vision backbone / codebook
uniact_model.load_state_dict(torch.load("model path here"), strict=False)
### Then load embodiment-specifc decoders
uniact_model.load_state_dict(torch.load("decoder path here"), strict=False)
from datasets.utils import LLAVAOV_PREPROCESSOR, R18_PREPROCESSOR
from PIL import Image
proprios = None # defaultly disable the proprios, use it for ACT decoder(Need to be normalized)
language_instruction = "your language instruction here"
image_list = ["image-view1 path here", "image-view2 path here"]
img_np = []
img_tensor = []
for image in image_list:
with Image.open(image) as img:
img = img.convert('RGB')
img_np.append(np.asarray(img))
img_tensor.append(R18_PREPROCESSOR(img))
img_np = np.stack(img_np),
img_tensor = torch.stack(img_tensor),
text = [LLAVAOV_PREPROCESSOR.apply_chat_template([
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": language_instruction},
]
}], add_generation_prompt=True)]
video = [np.expand_dims(img_np[0], axis=0)] # only use the primary view for extractor!
inputs = LLAVAOV_PREPROCESSOR(videos=video, text=text, return_tensors="pt", padding=True)
inputs = {'inputs': inputs.to('cuda', torch.bfloat16),
'images': img_tensor.unsqueeze(0).to('cuda', torch.bfloat16),
}
if proprios is not None: inputs['proprios'] = proprios.to('cuda', torch.bfloat16)
pred_action = uniact_model.infer(
domain_name = "libero-1-rgb", # check the model_config.py for the domain_name
**inputs
)
Note: Please remember to denormalize the 'pred_action', kindly check the action statics for AIRData and OXE
Models | Description | ckpt | Action normalize method | Observation type | Avg Succ Rate |
---|---|---|---|---|---|
Basemodel | Params for Universal Action Extractor / Vision Backbone / Universal Action Codebook | hf_link | - | Static view | - |
Libero-MLP-Decoder | Params for MLP decoder on Libero | hf_link | mean-std | Static view | 61.3% |
Bridge-MLP-Decoder | Params for MLP decoder on Bridge | hf_link | mean-std | Static view | 63.3% |
As we haven't access the performance of other decoder heads, we will not release them. If you have any questions about this, please feel free to contact us.
Please follow the guide in the official repo to install the LIBERO simulation.
LIBERO (MLP Head)
You can directly run the following command by replacing YOUR_BASEMODEL_CKPT_PATH
and YOUR_HEAD_CKPT_PATH
as your base model and head ckpt pathes, e.g., /data/UniAct/basemodel.pt
and /data/UniAct/libero_mlp.pt
:
python eval/libero/run_uniact_libero_eval.py \
--base_path YOUR_BASEMODEL_CKPT_PATH \
--head_path YOUR_HEAD_CKPT_PATH \
--num_episodes 20 \
Firstly install the required packages for training, kindly refer to train/requirements.txt
-
Firstly download OXE dataset(tfds files) from the official repo
-
Fill the file path in dataset.py
# set this if you store the files in s3 ceph
S3Path = ''
# set this if you store the files in local machine
LOCAL_OXE = ''
As we have refined the data and there may be some conflict to your own data. Please carefully fill the model_config: Fill the model settings in model_config.py. Currently support decoders:
- ACT decoder (Refer to ACT_decoder.py for specifc name)
- MLP decoder (Refer to MLP_decoder.py for specifc name)
srun -p mozi-S1 -n8 --gres=gpu:8 --ntasks-per-node=8 \
python -u train/slurm_deepspeed_train.py \
--model UniAct_05B_CodeBook_256_V1_Pretrain \
--recipe oxe_magic_soup \
--iters 1000000 \
--start_iters 0 \
--initial_t 2.0 \
--final_t 0.5 \
--batch-size 32 \
--lr 1e-5 \
--grad_accumulation_steps 1 \
--output_dir exp/pretrain \
--save_interval 10000 \
--precision bf16
We recommend you to use 'AIRData' as the data engine to train UniAct on your own embodiment. And that is what we do when we train UniAct on Libero! It may require some data reconstruction on your own dataset. We provide the data processing script for Libero as an example. Kindly refer to Libero_hdf52jpg.py. The image files structures should be as follow:
|-- traj-1
| `-- frame-0.jpg
| `-- frame-1.jpg
| `-- frame-2.jpg
| `-- ...
| `-- frame-41.jpg
|-- traj-2
| `-- frame-0.jpg
| `-- frame-1.jpg
| `-- frame-2.jpg
| `-- ...
| `-- frame-46.jpg
...
After you transfer the data into jpg format, you need to follow the following instructions to adapt it to the codebase:
- Construct a meta file(.pkl) as following structures: . Here is a example: Libero.pkl
|--
| `-- path: 'traj-1'
| `-- length: 41
| `-- instruction: 'pick up the red cup'
| `-- action: np.ndarray with shape(41, dim_action)
| `-- proprios: np.ndarray with shape(41, dim_proprio)
|--
| `-- path: 'traj-2'
| `-- length: 46
| `-- instruction: '...'
| `-- action: np.ndarray with shape(46, dim_action)
| `-- proprios: np.ndarray with shape(46, dim_proprio)
|...
-
Modify the config.py please add your data meta infos follow the data structures in the file
-
Modify the mixture.py
-
Modify the model_config.py Please choose one decoder head for your embodiment and revise the file, currently support:
srun -p mozi-S1 -n8 --gres=gpu:8 --ntasks-per-node=8 \
python -u slurm_deepspeed_train.py \
--model UniAct_05B_CodeBook_256_V1_For_Fast_Adaptation \
--recipe ACT-Libero \
--iters 1000000 \
--start_iters 0 \
--batch-size 32 \
--lr 1e-4 \
--grad_accumulation_steps 1 \
--output_dir exp/Libero \
--save_interval 10000 \
--port 12345 \
--seed 178945 \
--precision bf16 \
--resume "path of pretrained basemodel"
- prepare the OXE dataset and AIRData following the above instruction
- set
UniAct-1.0
as the data-recipe in the script - set
UniAct_05B_CodeBook_256_V1_Pretrain
as the model in the script
📆 TODO
- Release training codebase.
- Release code for depolyment.
- Release model checkpoints.
- Release training guidance.
This work is built upon the huggingface and llava-one-vision.