-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve log readability #122
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,7 +29,6 @@ | |
import logging | ||
import os | ||
import socket | ||
import uuid | ||
from concurrent.futures import ThreadPoolExecutor | ||
from datetime import timedelta | ||
from enum import Enum | ||
|
@@ -191,7 +190,7 @@ def __init__( | |
|
||
if replica_id is None: | ||
replica_id = "" | ||
replica_id = replica_id + str(uuid.uuid4()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wasn't sure why uuid was added to the replica_id, is this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ye, I was wondering the same thing. TorchTitan has to keep the original There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do this as a simplification for the quorum algorithm -- if a worker restarts quickly with the same name we may not detect that the quorum has changed and thus no-reconfiguration of the PGs will occur. Adding a random UUID to the worker means that we will always detect processes restart and thus correctly trigger a reconfiguration @H-Huang can you add this back and if it's convenient also add a comment explaining why this is the case? Also it would be nice to add a |
||
replica_id = replica_id | ||
self._manager = ManagerServer( | ||
replica_id=replica_id, | ||
lighthouse_addr=lighthouse_addr, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this macro access
self.replica_id
instead of taking it as a parameter?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to move
compute_quorum_results()
(torchft/src/manager.rs
Line 357 in 082753c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh this is fine then -- was just curious if we could clean it up a bit but this is fine