-
Notifications
You must be signed in to change notification settings - Fork 37
Pull requests: pytorch/torchft
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
option 2 - call work.wait inside wrapped work
CLA Signed
This label is managed by the Meta Open Source bot.
#248
opened Jul 26, 2025 by
tushar00jain
Loading…
return work from manager allreduce
CLA Signed
This label is managed by the Meta Open Source bot.
#247
opened Jul 26, 2025 by
tushar00jain
Loading…
fix stream dependencies in callbacks
CLA Signed
This label is managed by the Meta Open Source bot.
#246
opened Jul 26, 2025 by
tushar00jain
Loading…
deep copy state dict for checkpoint
CLA Signed
This label is managed by the Meta Open Source bot.
#245
opened Jul 26, 2025 by
tushar00jain
Loading…
use http transport
CLA Signed
This label is managed by the Meta Open Source bot.
#244
opened Jul 26, 2025 by
tushar00jain
Loading…
option 1 - use block_current to overlap compute/communication
CLA Signed
This label is managed by the Meta Open Source bot.
#243
opened Jul 26, 2025 by
tushar00jain
Loading…
ProcessGroupNCCL: always eager init to avoid duplicate communicators for p2p ops
CLA Signed
This label is managed by the Meta Open Source bot.
#242
opened Jul 25, 2025 by
d4l3k
Loading…
fix compute/communication overlap for gloo
CLA Signed
This label is managed by the Meta Open Source bot.
#240
opened Jul 22, 2025 by
tushar00jain
Loading…
Fixing the issue with indentation on the landing page
CLA Signed
This label is managed by the Meta Open Source bot.
#227
opened Jul 9, 2025 by
svekars
Loading…
Add config sharing from Lighthouse with UI support (#130)
CLA Signed
This label is managed by the Meta Open Source bot.
#202
opened May 24, 2025 by
WarrenZhu050413
•
Draft
Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198)
CLA Signed
This label is managed by the Meta Open Source bot.
#200
opened May 22, 2025 by
WarrenZhu050413
Loading…
ParallelProcessGroup: 200gbps with Gloo -- what if we just run like 20 of them in parallel???
CLA Signed
This label is managed by the Meta Open Source bot.
#199
opened May 21, 2025 by
d4l3k
Loading…
Added proactive heartbeat timeout failure propagation (#164) (#188)
CLA Signed
This label is managed by the Meta Open Source bot.
#196
opened May 20, 2025 by
WarrenZhu050413
Loading…
Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment
CLA Signed
This label is managed by the Meta Open Source bot.
#189
opened May 5, 2025 by
MattKotzbauer
Loading…
Disable async quorum for the first quorum sync
CLA Signed
This label is managed by the Meta Open Source bot.
Test manager join
CLA Signed
This label is managed by the Meta Open Source bot.
#62
opened Jan 8, 2025 by
Jackmin801
•
Draft
ProTip!
What’s not been updated in a month: updated:<2025-06-27.