[HIL-SERL] Migrate threading to multiprocessing #759

helper2424 · 2025-02-21T16:29:47Z

What this does

Explain what this PR does. Feel free to tag your PR with the appropriate label(s).

Examples:

Title	Label
Fixes #[issue]	(🐛 Bug)
Adds new dataset	(🗃️ Dataset)
Optimizes something	(⚡️ Performance)

How it was tested

Explain/show how you tested your changes.

Examples:

Added test_something in tests/test_stuff.py.
Added new_feature and checked that training converges with policy X on dataset/environment Y.
Optimized some_function, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

pytest -sx tests/test_stuff.py::test_something

python lerobot/scripts/train.py --some.option=true

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

Note: Before submitting this PR, please read the contributor guideline.

ChorntonYoel · 2025-02-24T17:29:45Z

lerobot/scripts/server/network_utils.py

+            step = 0
+
+            logging.debug(
+                f"{log_prefix} Queue updated, {queue.qsize()} items in the queue"


the qsize() breaks on macos, unless the queues are from a manager. So I think either this part should to change, or we should just create the queues like this in the actor_server:

parameters_queue = manager.Queue(maxsize=1) transitions_queue = manager.Queue(maxsize=5000) interactions_queue = manager.Queue(maxsize=5000)`

Got it, fixed

AdilZouitine · 2025-03-03T09:12:34Z

lerobot/common/utils/utils.py

+            "/".join(
+                [".."] * (len(path2.parts) - len(common_parts))
+                + list(path1.parts[len(common_parts) :])
+            )


What is the purpose of this function? why we need it in the init_hydra_config 😄

It's a good question. All these fixes came from the linter, it's not me.

I have added an ability to use files in init_logging only

AdilZouitine · 2025-03-03T09:16:01Z

lerobot/configs/policy/sac_maniskill.yaml

@@ -110,4 +108,4 @@ policy:
 actor_learner_config:
  learner_host: "127.0.0.1"
  learner_port: 50051
-  policy_parameters_push_frequency: 15
+  policy_parameters_push_frequency: 1


This is not too frequent? the network handle it? It explain also why you converge in 20k because your update is more frequent

learner_port=8083 Why this change? The default port for grpc is 50051 by convention

I have returned the port to 50051. I used anothe one - because in vast.ai they use port forwarding, so internal port will 8088 but the external for the machine will 50051. I made it for testing, but removed already.

Regarding the grequency - yeah, I have played with it. And the situation looks following:

we have a learner, which updates it's weights hundred times per second

we have actor that collects data on it's side

They both depends on each other - as soon we have better weights - we should deliver it to actor, it collects better data - we deliver it to learner, learner so it could train with new data and create better weights.

Delivering weights every 15 seconds is very slow for such architecture. The learner have weights updates every second - so we can deliver it to the actor. If learning process would be slower - like bigger nn's, than we can deliver it now so often, but convergence will be slower too.

lerobot/scripts/server/utils.py

AdilZouitine · 2025-03-03T09:21:58Z

lerobot/scripts/server/buffer.py

@@ -24,6 +24,7 @@



We have to rebase, to integrate the last replay buffer modification, it helps a lot for the performance

yeah, it should be already rebased on the last commit from your branch

AdilZouitine · 2025-03-03T09:24:49Z