Skip to content

add debugability for baby pg #213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 12, 2025
Merged

add debugability for baby pg #213

merged 2 commits into from
Jun 12, 2025

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 12, 2025

Summary:

  • running multiple processes a few limitations
    • we can't get gpu profiles from subprocesses
    • the results can differ because of cuda using a different context that can't run concurrently, this can make it hard to debug if there's something wrong with the code or if it's an artefact of cuda context
  • use multiprocessing.dummy to use threads instead of process

Test Plan:
using the patch with baby nccl, we can get overlapping communication and computation

image

we cannot get the overlap when using multiple processes, indicating it has something to do with cuda context

image

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2025
@tushar00jain tushar00jain force-pushed the pr213 branch 2 times, most recently from 0924fe1 to a362dda Compare June 12, 2025 17:29
@tushar00jain tushar00jain marked this pull request as ready for review June 12, 2025 17:34
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on changes in multiprocessing_dummy_context -- I'm assuming the first commit is the same as #211 and didn't review the rest

Summary:
- set the same stream as the one used for work in future continuations so that random streams don't depend on pg stream (this can make these streams dependent on the allreduce stream)
- wait on the work sent to pg's immediately on the fragment streams (used for allreduce) to make them depend on the pg stream and so that they don't depend on any future work that's submitted to those streams
- copy grads before allreduce so that the inner optimization can use those and it doesn't create a dependency between the default stream and the pg stream
- add back support for quantized allreduce in manager
- change return types to be consistent with pg allreduce
- the returned future from quantization collectives hangs (likely because set_result is not called?) so changed it to return the future directly from the pg

Test Plan:
- tested the changes with nccl pg
- synchronize on recovery stream sometimes makes the cpu block on collective (probably because some callback gets scheduled on the recovery stream? we need to remove synchronizing on recovery stream when there is no need to)
- calling `work.wait` returned by baby nccl pg makes the cpu block on the collective (because 2 contexts can't overlap?)
- pg gloo needs us to call `future.wait` in the sync phase instead of the prepare phase, so we probably need a different wrapper
- same for baby gloo pg

> Without Quantization

<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8f8dd694-a972-4bc6-96a0-8a79627a4d5d" />

> With Quantization

<img width="1123" alt="image" src="https://github.com/user-attachments/assets/b54288a3-9727-4956-89e7-c8b8775a98aa" />
Summary:
- running multiple processes a few limitations
  - we can't get gpu profiles from subprocesses
  - the results can differ because of cuda using a different context that can't run concurrently, this can make it hard to debug if there's something wrong with the code or if it's an artefact of cuda context
- use multiprocessing.dummy to use threads instead of process

Test Plan:
using the patch with baby nccl, we can get overlapping communication and computation

<img width="1539" alt="image" src="https://github.com/user-attachments/assets/39152858-1373-4318-8646-398141db3072" />

we cannot get the overlap when using multiple processes, indicating it has something to do with cuda context

<img width="1537" alt="image" src="https://github.com/user-attachments/assets/6b823d8e-a152-4678-a7e4-b6b8d6b6bb54" />
@tushar00jain tushar00jain merged commit 7898bfd into pytorch:main Jun 12, 2025
13 of 18 checks passed
@tushar00jain tushar00jain deleted the pr213 branch June 12, 2025 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants