add debugability for baby pg #213

tushar00jain · 2025-06-12T17:07:05Z

Summary:

running multiple processes a few limitations
- we can't get gpu profiles from subprocesses
- the results can differ because of cuda using a different context that can't run concurrently, this can make it hard to debug if there's something wrong with the code or if it's an artefact of cuda context
use multiprocessing.dummy to use threads instead of process

Test Plan:
using the patch with baby nccl, we can get overlapping communication and computation

we cannot get the overlap when using multiple processes, indicating it has something to do with cuda context

Stack created with Sapling. Best reviewed with ReviewStack.

d4l3k

LGTM on changes in multiprocessing_dummy_context -- I'm assuming the first commit is the same as #211 and didn't review the rest

Summary: - set the same stream as the one used for work in future continuations so that random streams don't depend on pg stream (this can make these streams dependent on the allreduce stream) - wait on the work sent to pg's immediately on the fragment streams (used for allreduce) to make them depend on the pg stream and so that they don't depend on any future work that's submitted to those streams - copy grads before allreduce so that the inner optimization can use those and it doesn't create a dependency between the default stream and the pg stream - add back support for quantized allreduce in manager - change return types to be consistent with pg allreduce - the returned future from quantization collectives hangs (likely because set_result is not called?) so changed it to return the future directly from the pg Test Plan: - tested the changes with nccl pg - synchronize on recovery stream sometimes makes the cpu block on collective (probably because some callback gets scheduled on the recovery stream? we need to remove synchronizing on recovery stream when there is no need to) - calling `work.wait` returned by baby nccl pg makes the cpu block on the collective (because 2 contexts can't overlap?) - pg gloo needs us to call `future.wait` in the sync phase instead of the prepare phase, so we probably need a different wrapper - same for baby gloo pg > Without Quantization <img width="1188" alt="image" src="https://github.com/user-attachments/assets/8f8dd694-a972-4bc6-96a0-8a79627a4d5d" /> > With Quantization <img width="1123" alt="image" src="https://github.com/user-attachments/assets/b54288a3-9727-4956-89e7-c8b8775a98aa" />

Summary: - running multiple processes a few limitations - we can't get gpu profiles from subprocesses - the results can differ because of cuda using a different context that can't run concurrently, this can make it hard to debug if there's something wrong with the code or if it's an artefact of cuda context - use multiprocessing.dummy to use threads instead of process Test Plan: using the patch with baby nccl, we can get overlapping communication and computation <img width="1539" alt="image" src="https://github.com/user-attachments/assets/39152858-1373-4318-8646-398141db3072" /> we cannot get the overlap when using multiple processes, indicating it has something to do with cuda context <img width="1537" alt="image" src="https://github.com/user-attachments/assets/6b823d8e-a152-4678-a7e4-b6b8d6b6bb54" />

This was referenced Jun 12, 2025

enable merging parameters for diloco #212

Draft

support async in nccl pg #211

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2025

tushar00jain force-pushed the pr213 branch 2 times, most recently from 0924fe1 to a362dda Compare June 12, 2025 17:29

tushar00jain marked this pull request as ready for review June 12, 2025 17:34

d4l3k approved these changes Jun 12, 2025

View reviewed changes

tushar00jain added 2 commits June 12, 2025 11:35

tushar00jain force-pushed the pr213 branch from a362dda to 388c390 Compare June 12, 2025 18:35

tushar00jain merged commit 7898bfd into pytorch:main Jun 12, 2025
13 of 18 checks passed

tushar00jain deleted the pr213 branch June 12, 2025 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add debugability for baby pg #213

add debugability for baby pg #213

Uh oh!

tushar00jain commented Jun 12, 2025 •

edited

Loading

Uh oh!

d4l3k left a comment

Uh oh!

Uh oh!

Uh oh!

add debugability for baby pg #213

add debugability for baby pg #213

Uh oh!

Conversation

tushar00jain commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jun 12, 2025 •

edited

Loading