Skip to content

Add config sharing from Lighthouse with UI support (#130) #202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

WarrenZhu050413
Copy link
Contributor

Add config sharing from Lighthouse with UI support (#130)

Implemented GetConfig RPC for the Lighthouse

Summary

Implemented and tested the GetConfig RPC functionality for the Lighthouse service, enabling configuration broadcasting from the lighthouse to client applications. Along with this, added UI support

Tests

  • Added GetConfigRPC to the Lighthouse
  • Added Python bindings to the RPC
  • Added test_get_config_rpc() in torchft/lighthouse_test.py to test that the GetConfig RPC endpoint to ensure it returns configuration data

Current Implementation Details

The GetConfig RPC currently:

  • Returns configuration data as HashMap<String, String> (Python dict[str, str])
  • Supports timeout configuration (default: 5 seconds)
  • Loads configuration from JSON files specified via --lighthouse-config parameter
  • Returns empty configuration when no config file is provided
  • Handles invalid JSON by returning empty configuration

Discussion: Incorporating config sharing into the entire torchFT workflow

I am personally really excited about incorporating config sharing into torchFT, as I believe it enables a lot of intereting possibilities.

With config sharing, the lighthouse could reconfigure the training process at runtime according to user input/information that it has, acting as an orchestrator for the training process.

Some usecases that I can think of include:
a) Dynamic Training Algorithm Configuration

  • Adjust DiLoCo synchronization frequency based on network conditions/loss curve convergence (this could be in the examples to showcase the config reconfiguration feature)
  • Change the optimizer that the training process uses at runtime
  • Modify learning rates based on training progress (rather than relying on LR scheduler)
  • Enable interactive batch size adjustments

b) Advanced Fault Tolerance

  • Coordinate parallelism structure changes upon worker failures
  • Reroute messages in MoE models when nodes fail
  • Rerouting to healthy workers in pipeline parallelism upon a worker failure
  • Have a time series model that predicts worker failures and route to that worker upon failure.

With these in mind, I want feedback on two questions:

  1. Whether the usecases that I envision above belong to torchFT, or some other newer, experimental framework
  2. Taking into account these factors, what is the best way for the Manager to use the config sharing functionality?

Approach 1: Request-Response via Manager

The most torchFT idiomatic way to use the config sharing functionality is to have a two-stage RPC: Worker -> Manager -> Lighthouse -> Manager -> Worker. At the beginning of each training iteration, ManagerClient can call a GetConfig RPC to the manager, and then send the config to the worker.

However, this approach has two limitations:

  1. Because all the communication between the worker and the Lighthouse go through a ManagerServer, the Lighthouse does not have direct access to the Worker. In the configuration context, it is natural for every client under a ManagerServer to receive the same configuration.
  2. Relying on Request-Response also decreases the speed at which the configs are broadcasted to the workers. This would require the worker to request and wait for a response for the Lighthouse (of course, this could be done through a prior asynchronous call to avoid blocking on the config, but the relevancy of the received message may still be low).
    1. A streaming approach could be adopted here. However, because the streaming code is itself relatively complex, managing a two-stage streaming handoff process increases a lot of complexity into the code and makes it hard to maintain, as many logic has to be written twice.

Approach 2: Direct Streaming from the Lighthouse to the workers

Because of the above considerations, I am considering extending the experimental FailureStream implemented in PR 196 to become a generic LighthouseStream that can handle multiple message types.

This addresses the two limitations of the Request-Response RPC code.

However, it has its own problems:

a) It breaks the Worker -> Manager -> Lighthouse structure that was previously maintained
b) It potentially decreases the scalability of torchFT, since the Lighthouse needs to maintain connections to all the workers

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 24, 2025
@WarrenZhu050413
Copy link
Contributor Author

config UI

^UI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants