Add config sharing from Lighthouse with UI support (#130) #202

WarrenZhu050413 · 2025-05-24T13:53:49Z

Add config sharing from Lighthouse with UI support (#130)

Implemented GetConfig RPC for the Lighthouse

Summary

Implemented and tested the GetConfig RPC functionality for the Lighthouse service, enabling configuration broadcasting from the lighthouse to client applications. Along with this, added UI support

Tests

Added GetConfigRPC to the Lighthouse
Added Python bindings to the RPC
Added test_get_config_rpc() in torchft/lighthouse_test.py to test that the GetConfig RPC endpoint to ensure it returns configuration data

Current Implementation Details

The GetConfig RPC currently:

Returns configuration data as HashMap<String, String> (Python dict[str, str])
Supports timeout configuration (default: 5 seconds)
Loads configuration from JSON files specified via --lighthouse-config parameter
Returns empty configuration when no config file is provided
Handles invalid JSON by returning empty configuration

Discussion: Incorporating config sharing into the entire torchFT workflow

I am personally really excited about incorporating config sharing into torchFT, as I believe it enables a lot of intereting possibilities.

With config sharing, the lighthouse could reconfigure the training process at runtime according to user input/information that it has, acting as an orchestrator for the training process.

Some usecases that I can think of include:
a) Dynamic Training Algorithm Configuration

Adjust DiLoCo synchronization frequency based on network conditions/loss curve convergence (this could be in the examples to showcase the config reconfiguration feature)
Change the optimizer that the training process uses at runtime
Modify learning rates based on training progress (rather than relying on LR scheduler)
Enable interactive batch size adjustments

b) Advanced Fault Tolerance

Coordinate parallelism structure changes upon worker failures
Reroute messages in MoE models when nodes fail
Rerouting to healthy workers in pipeline parallelism upon a worker failure
Have a time series model that predicts worker failures and route to that worker upon failure.

With these in mind, I want feedback on two questions:

Whether the usecases that I envision above belong to torchFT, or some other newer, experimental framework
Taking into account these factors, what is the best way for the Manager to use the config sharing functionality?

Approach 1: Request-Response via Manager

The most torchFT idiomatic way to use the config sharing functionality is to have a two-stage RPC: Worker -> Manager -> Lighthouse -> Manager -> Worker. At the beginning of each training iteration, ManagerClient can call a GetConfig RPC to the manager, and then send the config to the worker.

However, this approach has two limitations:

Because all the communication between the worker and the Lighthouse go through a ManagerServer, the Lighthouse does not have direct access to the Worker. In the configuration context, it is natural for every client under a ManagerServer to receive the same configuration.
Relying on Request-Response also decreases the speed at which the configs are broadcasted to the workers. This would require the worker to request and wait for a response for the Lighthouse (of course, this could be done through a prior asynchronous call to avoid blocking on the config, but the relevancy of the received message may still be low).
1. A streaming approach could be adopted here. However, because the streaming code is itself relatively complex, managing a two-stage streaming handoff process increases a lot of complexity into the code and makes it hard to maintain, as many logic has to be written twice.

Approach 2: Direct Streaming from the Lighthouse to the workers

Because of the above considerations, I am considering extending the experimental FailureStream implemented in PR 196 to become a generic LighthouseStream that can handle multiple message types.

This addresses the two limitations of the Request-Response RPC code.

However, it has its own problems:

a) It breaks the Worker -> Manager -> Lighthouse structure that was previously maintained
b) It potentially decreases the scalability of torchFT, since the Lighthouse needs to maintain connections to all the workers

…ytorch#188)

WarrenZhu050413 · 2025-05-24T13:56:06Z

^UI

WarrenZhu050413 added 2 commits May 21, 2025 13:31

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

7b550aa

…ytorch#188)

Add config sharing from Lighthouse with UI support (pytorch#130)

f2df415

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add config sharing from Lighthouse with UI support (#130) #202

Add config sharing from Lighthouse with UI support (#130) #202

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

Uh oh!

Add config sharing from Lighthouse with UI support (#130) #202

Are you sure you want to change the base?

Add config sharing from Lighthouse with UI support (#130) #202

Uh oh!

Conversation

WarrenZhu050413 commented May 24, 2025