-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve log readability #122
Conversation
@@ -191,7 +190,7 @@ def __init__( | |||
|
|||
if replica_id is None: | |||
replica_id = "" | |||
replica_id = replica_id + str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure why uuid was added to the replica_id, is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ye, I was wondering the same thing. TorchTitan has to keep the original replicate_id
because this change every time due to uuid4
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this as a simplification for the quorum algorithm -- if a worker restarts quickly with the same name we may not detect that the quorum has changed and thus no-reconfiguration of the PGs will occur. Adding a random UUID to the worker means that we will always detect processes restart and thus correctly trigger a reconfiguration
@H-Huang can you add this back and if it's convenient also add a comment explaining why this is the case? Also it would be nice to add a :
to it so replica_01234-1234
shows up as replica_0:1234-1234
. I've been meaning to fix this but haven't had a chance
|
@@ -191,7 +190,7 @@ def __init__( | |||
|
|||
if replica_id is None: | |||
replica_id = "" | |||
replica_id = replica_id + str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this as a simplification for the quorum algorithm -- if a worker restarts quickly with the same name we may not detect that the quorum has changed and thus no-reconfiguration of the PGs will occur. Adding a random UUID to the worker means that we will always detect processes restart and thus correctly trigger a reconfiguration
@H-Huang can you add this back and if it's convenient also add a comment explaining why this is the case? Also it would be nice to add a :
to it so replica_01234-1234
shows up as replica_0:1234-1234
. I've been meaning to fix this but haven't had a chance
src/manager.rs
Outdated
($replica_id:expr, $($arg:tt)*) => { | ||
info!( | ||
"[Replica {}] {}", | ||
$replica_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this macro access self.replica_id
instead of taking it as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to move compute_quorum_results()
(
Line 357 in 082753c
fn compute_quorum_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh this is fine then -- was just curious if we could clean it up a bit but this is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Couple of changes to logging:
RUST_LOG
envlog_if_changed
to lighthouse.rs so it will only log if there is a message change. The quorum tick is pretty frequent by defaultinfo_with_replica
Open to suggestion / changes.
Previous log example:

full paste: P1747150901
New log example:

full paste: P1747151799