Add config sharing from Lighthouse with UI support (#130) #202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add config sharing from Lighthouse with UI support (#130)
Implemented GetConfig RPC for the Lighthouse
Summary
Implemented and tested the GetConfig RPC functionality for the Lighthouse service, enabling configuration broadcasting from the lighthouse to client applications. Along with this, added UI support
Tests
test_get_config_rpc()
intorchft/lighthouse_test.py
to test that the GetConfig RPC endpoint to ensure it returns configuration dataCurrent Implementation Details
The GetConfig RPC currently:
HashMap<String, String>
(Pythondict[str, str]
)--lighthouse-config
parameterDiscussion: Incorporating config sharing into the entire torchFT workflow
I am personally really excited about incorporating config sharing into torchFT, as I believe it enables a lot of intereting possibilities.
With config sharing, the lighthouse could reconfigure the training process at runtime according to user input/information that it has, acting as an orchestrator for the training process.
Some usecases that I can think of include:
a) Dynamic Training Algorithm Configuration
b) Advanced Fault Tolerance
With these in mind, I want feedback on two questions:
Manager
to use the config sharing functionality?Approach 1: Request-Response via Manager
The most torchFT idiomatic way to use the config sharing functionality is to have a two-stage RPC: Worker -> Manager -> Lighthouse -> Manager -> Worker. At the beginning of each training iteration, ManagerClient can call a
GetConfig
RPC to the manager, and then send the config to the worker.However, this approach has two limitations:
Approach 2: Direct Streaming from the Lighthouse to the workers
Because of the above considerations, I am considering extending the experimental
FailureStream
implemented in PR 196 to become a genericLighthouseStream
that can handle multiple message types.This addresses the two limitations of the Request-Response RPC code.
However, it has its own problems:
a) It breaks the Worker -> Manager -> Lighthouse structure that was previously maintained
b) It potentially decreases the scalability of torchFT, since the Lighthouse needs to maintain connections to all the workers