Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support acclerate multi gpu training #558

Draft
wants to merge 29 commits into
base: user/rcadene/2024_10_07_vla
Choose a base branch
from

Conversation

mshukor
Copy link

@mshukor mshukor commented Dec 8, 2024

What this does

Based on this PR. It includes:

  • The ability to keep training without accelerate
  • Updated to the recent main
  • Some minor fixes

Note: we still need to merge with vla branch before merging

How it was tested

ENV=aloha
ENV_TASK=AlohaTransferCube-v0
dataset_repo_id=lerobot/aloha_sim_transfer_cube_human
policy=act
LR=1e-5
LR_SCHEDULER=
USE_AMP=false
ASYNC_ENV=false

GPUS=2
EVAL_FREQ=10000 #51000 #10000 51000
OFFLINE_STEPS=100000 #25000 17000 12500 50000
TRAIN_BATCH_SIZE=4 # global batch size / num of gpus
EVAL_BATCH_SIZE=50

TASK_NAME=lerobot_${ENV}_transfer_cube_${policy}_2gpus

python -m accelerate.commands.launch --num_processes=$GPUS --mixed_precision=fp16 lerobot/scripts/train.py \
 hydra.job.name=base_distributed_aloha_transfer_cube \
 hydra.run.dir=/data/mshukor/logs/lerobot/${TASK_NAME} \
 dataset_repo_id=$dataset_repo_id \
 policy=$policy \
 env=$ENV env.task=$ENV_TASK \
 training.offline_steps=$OFFLINE_STEPS training.batch_size=$TRAIN_BATCH_SIZE \
 training.eval_freq=$EVAL_FREQ eval.n_episodes=50 eval.use_async_envs=$ASYNC_ENV eval.batch_size=$EVAL_BATCH_SIZE \
 training.lr_scheduler=$LR_SCHEDULER training.lr=$LR \
 wandb.enable=true 

Cadene and others added 29 commits October 3, 2024 17:05
Co-authored-by: jess-moss <[email protected]>
Co-authored-by: Simon Alibert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.