Support multi-gpus training with accelerate #778

mshukor · 2025-02-26T16:25:12Z

What this does

This PR supports training on multiple gpus using the accelerate librarie

How it was tested

Launching training on aloha sim with multiple GPUs and obtaining similar scores.

Examples:
This requires installing accelerate:

pip install accelerate

POLICY=act

ENV=aloha
TASK=AlohaTransferCube-v0
REPO_ID=lerobot/aloha_sim_transfer_cube_human
DATASET_NAME=aloha_sim_transfer_cube_human

TASK_NAME=lerobot_${DATASET_NAME}_${POLICY}_gpus${GPUS}
TRAIN_DIR=$WORK/logs/lerobot/$TASK_NAME
echo $TRAIN_DIR

PORT=29502

GPUS=2
OFFLINE_STEPS=100000
EVAL_FREQ=1000
BATCH_SIZE=8
EVAL_BATCH_SIZE=10
SAVE_FREQ=10000

export MUJOCO_GL=egl

python -m accelerate.commands.launch --num_processes=$GPUS --mixed_precision=fp16 --main_process_port=$PORT lerobot/scripts/train.py \
     --policy.type=$POLICY  \
     --dataset.repo_id=$REPO_ID \
     --env.type=$ENV \
     --env.task=$TASK \
     --output_dir=$TRAIN_DIR \
     --batch_size=$BATCH_SIZE \
     --steps=$OFFLINE_STEPS \
     --eval_freq=$EVAL_FREQ --save_freq=$SAVE_FREQ --eval.batch_size=$EVAL_BATCH_SIZE --eval.n_episodes=$EVAL_BATCH_SIZE

qgallouedec · 2025-02-27T15:46:23Z

@bot /style

github-actions · 2025-02-27T15:46:43Z

Style fixes have been applied. View the workflow run here.

aliberts

First round with comments, nice addition ;)

It'd be nice to add a quick tutorial in examples/ on how to do multi-gpu training.
Also, could you add accelerate as an extra to pyproject.toml?

# pyproject.toml

[project.optional-dependencies]
...
+ accelerate = [
+     "accelerate>=1.4.0",
+ ]

aliberts · 2025-02-27T17:49:17Z

lerobot/scripts/train.py

+        optimizer.step()
+    else:
+        grad_scaler.scale(loss).backward()
+        # Unscale the graident of the optimzer's assigned params in-place **prior to gradient clipping**.


Suggested change

# Unscale the graident of the optimzer's assigned params in-place **prior to gradient clipping**.

# Unscale the gradient of the optimizer's assigned params in-place **prior to gradient clipping**.

aliberts · 2025-02-27T17:54:00Z

lerobot/scripts/train.py

+    if accelerator:
+        if has_method(accelerator.unwrap_model(policy, keep_fp32_wrapper=True), "update"):
+            accelerator.unwrap_model(policy, keep_fp32_wrapper=True).update()
+    else:
+        if has_method(policy, "update"):
+            # To possibly update an internal buffer (for instance an Exponential Moving Average like in TDMPC).
+            policy.update()


Nit

Suggested change

if accelerator:

if has_method(accelerator.unwrap_model(policy, keep_fp32_wrapper=True), "update"):

accelerator.unwrap_model(policy, keep_fp32_wrapper=True).update()

else:

if has_method(policy, "update"):

# To possibly update an internal buffer (for instance an Exponential Moving Average like in TDMPC).

policy.update()

if accelerator and has_method(accelerator.unwrap_model(policy, keep_fp32_wrapper=True), "update"):

accelerator.unwrap_model(policy, keep_fp32_wrapper=True).update()

elif has_method(policy, "update"):

# To possibly update an internal buffer (for instance an Exponential Moving Average like in TDMPC).

policy.update()

aliberts · 2025-02-27T17:57:13Z

lerobot/scripts/train.py

+    if accelerator and not accelerator.is_main_process:
+        # Disable logging on non-main processes.
+        cfg.wandb.enable = False


I'm not sure this works as intended, are the metrics reported correct?
We should probably integrate accelerate's WandBTracker inside our WandBLogger instead.

aliberts · 2025-02-27T17:59:38Z

lerobot/scripts/train.py

+        if accelerator:
+            accelerator.wait_for_everyone()


I think this should be added before checkpointing as well. Also it should probably go inside the if is_saving_step or if is_eval_step statements, otherwise this will be blocking at each step.

aliberts · 2025-02-27T18:09:14Z

lerobot/scripts/train.py

        if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
-            save_checkpoint(checkpoint_dir, step, cfg, policy, optimizer, lr_scheduler)
+            save_checkpoint(
+                checkpoint_dir,
+                step,
+                cfg,
+                policy if not accelerator else accelerator.unwrap_model(policy),
+                optimizer,
+                lr_scheduler,
+            )


I don't know if this is fully equivalent to Accelerator.save_state. In particular, I don't think this still works with the training state (optimizer, scheduler, rng etc.)

Pointer: https://huggingface.co/docs/accelerate/v1.4.0/en/usage_guides/checkpoint

mshukor added 7 commits February 25, 2025 21:38

supporting accelerate

86e8c1d

testing accelerate

84af281

precommit

be0b8cd

remove train commands

18c87bc

fixme

acdd5f1

merge with main

8adf20d

remove unecessary functions

8cfe44b

mshukor marked this pull request as ready for review February 27, 2025 08:28

mshukor requested review from aliberts and Cadene and removed request for aliberts February 27, 2025 08:28

aliberts and others added 2 commits February 27, 2025 12:14

Merge branch 'main' into user/mshukor/2025_02_25_accelerate

a421485

Merge branch 'main' into user/mshukor/2025_02_25_accelerate

bed44f6

Apply style fixes

ec5355c

huggingface deleted a comment from qgallouedec Feb 27, 2025

Merge branch 'main' into user/mshukor/2025_02_25_accelerate

fb18c50

aliberts reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-gpus training with accelerate #778

Support multi-gpus training with accelerate #778

mshukor commented Feb 26, 2025 •

edited

Loading

qgallouedec commented Feb 27, 2025

github-actions bot commented Feb 27, 2025

aliberts left a comment

aliberts Feb 27, 2025

aliberts Feb 27, 2025

aliberts Feb 27, 2025

aliberts Feb 27, 2025 •

edited

Loading

aliberts Feb 27, 2025

	# Unscale the graident of the optimzer's assigned params in-place prior to gradient clipping.
	# Unscale the gradient of the optimizer's assigned params in-place prior to gradient clipping.

Support multi-gpus training with accelerate #778

Are you sure you want to change the base?

Support multi-gpus training with accelerate #778

Conversation

mshukor commented Feb 26, 2025 • edited Loading

What this does

How it was tested

qgallouedec commented Feb 27, 2025

github-actions bot commented Feb 27, 2025

aliberts left a comment

Choose a reason for hiding this comment

aliberts Feb 27, 2025

Choose a reason for hiding this comment

aliberts Feb 27, 2025

Choose a reason for hiding this comment

aliberts Feb 27, 2025

Choose a reason for hiding this comment

aliberts Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

aliberts Feb 27, 2025

Choose a reason for hiding this comment

mshukor commented Feb 26, 2025 •

edited

Loading

aliberts Feb 27, 2025 •

edited

Loading