Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port HIL-SERL #565

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

michel-aractingi
Copy link
Collaborator

@michel-aractingi michel-aractingi commented Dec 9, 2024

What this does

Adds HIL-SERL to the policies of LeRobot in lerobot/common/policies/hilserl/.

What this PR contains so far

  1. The ability to assign binary rewards during recording datasets in lerobot/scripts/control_robot.py -> record function.
  2. Reward classifier,
    • Code to define and train a reward classifier model to detect successful tasks in lerobot/common/policies/hilserl/classifier.
    • Script to train the reward classifier lerobot/scripts/train_hilserl_classifier.py
  3. Rollout on the real robot and human intervention, in script lerobot/scripts/eval_on_robot.py we added the ability to do policy rollouts on the real robot. Moreover, you also have the ability to stop the policy actions bieng rolled-out and take over if you have a leader arm.

How to test:

  1. Annotate episodes with reward during recordings:
python lerobot/scripts/control_robot.py record \
    --robot-path lerobot/configs/robot/moss.yaml \
    --fps 30 \
    --root data \
    --repo-id ${HF_USER}/moss_test \
    --tags moss tutorial \
    --warmup-time-s 5 \
    --episode-time-s 40 \
    --reset-time-s 10 \
    --num-episodes 2 \
    --push-to-hub 1 \
    --assign_rewards 1
  1. Train reward classifier:
python lerobot/scripts/train_hilserl_classifier.py --config-name policy/reward_classifier.yaml
  1. Run an example of eval on the robot and test human interventions
python lerobot/scripts/eval_on_robot.py --robot-path lerobot/configs/robot/koch.yaml

References:

  1. HIL-SERL implementation https://github.com/rail-berkeley/hil-serl/tree/main
  2. reward assignment PR Reward assignment during recording  #518 and classifier Reward classifier and training #528 by @ChorntonYoel
  3. human Interventions Add human intervention mechanism and eval_robot script to evaluate policy on the robot #541

@Cadene Cadene requested review from aliberts and Cadene December 9, 2024 20:48
@michel-aractingi michel-aractingi force-pushed the user/michel-aractingi/2024-11-27-port-hil-serl branch from 3d7e74d to def42ff Compare December 17, 2024 15:22
KeWang1017 and others added 2 commits December 17, 2024 17:58
…ing logic

- Added `num_subsample_critics`, `critic_target_update_weight`, and `utd_ratio` to SACConfig.
- Implemented target entropy calculation in SACPolicy if not provided.
- Introduced subsampling of critics to prevent overfitting during updates.
- Updated temperature loss calculation to use the new target entropy.
- Added comments for future UTD update implementation.

These changes improve the flexibility and performance of the SAC implementation.
@mydhui
Copy link

mydhui commented Dec 18, 2024

@michel-aractingi Hi, Can you elaborate more on how to test hil-serl ?

During step 1, for instance a cube grasping task, should we record failure samples on purpose or reward transition from 0 to 1 after successful grasping is enough ?

What should be the expected behavior in step 3 ("eval on the robot and test human interventions"), would this algorithm perform better than ACT ?

Thanks.

helper2424 and others added 10 commits December 23, 2024 10:43
…n handling

- Updated action selection to use distribution sampling and log probabilities for better stochastic behavior.
- Enhanced standard deviation clamping to prevent extreme values, ensuring stability in policy outputs.
- Cleaned up code by removing unnecessary comments and improving readability.

These changes aim to refine the SAC implementation, enhancing its robustness and performance during training and inference.
- Updated standard deviation parameterization in SACConfig to 'softplus' with defined min and max values for improved stability.
- Modified action sampling in SACPolicy to use reparameterized sampling, ensuring better gradient flow and log probability calculations.
- Cleaned up log probability calculations in TanhMultivariateNormalDiag for clarity and efficiency.
- Increased evaluation frequency in YAML configuration to 50000 for more efficient training cycles.

These changes aim to enhance the robustness and performance of the SAC implementation during training and inference.
…d stability

- Updated SACConfig to replace standard deviation parameterization with log_std_min and log_std_max for better control over action distributions.
- Modified SACPolicy to streamline action selection and log probability calculations, enhancing stochastic behavior.
- Removed deprecated TanhMultivariateNormalDiag class to simplify the codebase and improve maintainability.

These changes aim to enhance the robustness and performance of the SAC implementation during training and inference.
@michel-aractingi michel-aractingi force-pushed the user/michel-aractingi/2024-11-27-port-hil-serl branch from bd8d252 to 35de91e Compare December 30, 2024 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants