Enable Domain Parallelism with ShardTensor #784

coreyjadams · 2025-02-06T01:48:01Z

Modulus Pull Request

Description

This PR adds new capabilities to Modulus:

ShardTensor is an extension to pytorch DTensor that enables uneven sharding of tensors across DeviceMesh objects. While some logical sharding constraints remain, this allows more dynamic and flexible operation on distributed input data, especially in cases where the input data shape and output data shape differ.
ShardTensor also enables an ecosystem of operation extensions. Two major ones are included in this PR: convolutions (1D/2D/3D) and neighborhood attention. When the right components of modulus are imported, these operations (when performed on sharded tensors) will automatically compute halo regions and perform data transfers to enable results consistent with single device outputs.
- For small data, this is not useful, but for extremely large data this is a powerful way to scale training on large inputs.
The documentation for Modulus now includes an API reference for ShardTensor, as well as an example of integrating multiple levels of parallelism by combining shard tensor and pytorch FSDP.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Adds a dependency on wrapt for monkey-patching operations on sharded inputs..

…simple DDP sharding

…ieces are WIP but this has basic functionality supported for creation and forward usage.

…ete.

…t of the ops have been validated, all that remains is to wrap the na2d function call to ensure it will dispatch properly.

…ng in unbind op rules.

….ops.aten.convolution.default.

…s also a minor bug in the backward pass that got more pronounced with smaller data: grad inputs were failing to properly collect haloed gradients and add them on the edges. Now fixed.

…tributed

…gnificant overhead. I'm implementing here an option to switch to peer to peer message passing, since it might benefit from stream utilization in layers like natten.na2d. It's a developer choice currently, not a user choice.

…gnificant functionality changes in this commit.

Add `scatter_tensor` function to enable more easy transition to shard tensor. This function allows users to maintain data pipelines (on one rank) and easily scatter that data to a domain mesh.

…sary

But also, this adjusts the shard tensor mechanism for tracking shard info to use a dict instead of a list of tuples.

No real code changes applied here.

…tributed

coreyjadams · 2025-02-13T16:35:09Z

/blossom-ci

ktangsali · 2025-02-13T17:00:39Z

/blossom-ci

ktangsali · 2025-02-14T17:20:57Z

Reminder to change the target to 0.10.0-rc branch before merging.

- First, the ability to transpose the sharding dimensions is supported. For square submeshs, 2x2 for example, the output sharding will match the input sharding if it's uneven. This can only be supported if the number of devices in the output mesh dimension is equal to the input dimension, hence the restriction on square submeshes. Other scenarios will apply dtensor-like chunk syntax, but return a shard tensor tracking that split. Comprehensive tests on 1D and 2D meshes are included here. No testing is done at this time on 3D sharding / meshes. - Second, the issues with torch.mean are intercepted and fixed. This uses a new dispatch intercept (below) and applies a weight to the mean, and converts the Partial placement to a Partial(sum) with the weight applied. This has a bug that appears to be present in DTensor too: reductions over non-sharded dimensions appear to falter. To be fixed in a future release. - Third, ShardTensor has a new class attribute to accomodate operator interceptions. The only applied function at this time are variants of aten.mean, however, it is expected to convert all monkey patching to this syntax.

…don't require them to trigger elsewhere. If ShardTensor is used, the patches get applied. Also, minor updates to docs.

coreyjadams · 2025-02-18T23:03:56Z

Target updated to 0.1.0.0-rc. All smaller comments from @akshaysubr resolved. Most residual bugs fixed.

This version of the release supports ShardTensor for select models to enable domain parallelism, not yet as a general tool.

coreyjadams · 2025-02-18T23:03:59Z

/multi-gpu-ci

coreyjadams · 2025-02-19T22:45:52Z

/multi-gpu-ci

…Tests are coming in the next commit immediately after.

…sharding. Further, fixed an annoying bug in other distributed tests where OS environs weren't cleared after testing, and tsome tests would fail but only if others ran first. Now, all distributed tests use a context manager to change OS environment variables locally only.

…tributed

coreyjadams · 2025-02-20T17:08:13Z

/multi-gpu-ci

- Enable dynamic (off by default) wrapping of layers by shard tensor. they get turned on automatically when a shard tensor is created. - Rename the utils to manage env variables. Tests are failing with unusual CPU errors on ORD. Moving to github runners ...

coreyjadams · 2025-02-20T21:00:55Z

/multi-gpu-ci

coreyjadams · 2025-02-20T21:28:29Z

/blossom-ci

…tributed

coreyjadams · 2025-02-20T22:17:54Z

/blossom-ci

@pzharrington

* Enable mesh-based parallelism as the configuration backend, even for simple DDP sharding * Fix small typo in docstring * Remove unnecessary functions with new interface * Adding first implementation of ShardTensor prototype. Still several pieces are WIP but this has basic functionality supported for creation and forward usage. * Working implementation of ShardTensor, though still somewhate incomplete. * Adding work-in-progress examples. Be careful of sharp edges! * A few more example pieces before natten will work out of the box. Most of the ops have been validated, all that remains is to wrap the na2d function call to ensure it will dispatch properly. * Fix naming scheme * Minor name change * Add monkey patching for na2d operation with shard tensors * Fix bug in shard tensor inference of globla size. CHeck agains sharding in unbind op rules. * Enable backwards gradients for halo sharding and natten patch * Convolution 2d backwards works, though would be better to catch torch.ops.aten.convolution.default. * Fix missing import and ensure tensors are contiguous before allgather_v * Clean up and remove unnecessary noise and printouts for debugging * Unify (and correct!) the sharded convolution implementation. There was also a minor bug in the backward pass that got more pronounced with smaller data: grad inputs were failing to properly collect haloed gradients and add them on the edges. Now fixed. * Remove noise from sharding utils. * For smaller tensors, the alltoall step of halo reductions might be significant overhead. I'm implementing here an option to switch to peer to peer message passing, since it might benefit from stream utilization in layers like natten.na2d. It's a developer choice currently, not a user choice. * Remove shard_utils file, it is a subfolder. * Add modulus ShardTensor api documentation * Clean up doc strings, type annotations and mesh implementation. No significant functionality changes in this commit. * Add significant docstring / type annotation cleanup to ShardTensor. Add `scatter_tensor` function to enable more easy transition to shard tensor. This function allows users to maintain data pipelines (on one rank) and easily scatter that data to a domain mesh. * Remove neighborhood attention prototypes * Remove the rest of these examples since they are outdated and unnecessary * Mostly, this commit is adding type annotations and doc strings. But also, this adjusts the shard tensor mechanism for tracking shard info to use a dict instead of a list of tuples. * Clean up and document conv patches. No real code changes applied here. * clean up and improve documentation and type hints for shard utils worker functions * Adding basic tests for shard tensor initialization and redistribution. There appears to be one corner case in redistribute to fix. TBD. Tests for grad propogation are coming. * Add full working example of multilevel parallelism with pytorch FSDP and modulus ShardTensor * Add missing type annotations * Ensure scatter_tensor is available to import from modulus.distributed * Update changelog and ensure wrapt is a optional dependency * Update fsdp_and_shard_tensor.rst Update tutorial based on feedback from @pzharrington * Update __init__.py Remove wildcard import. * Update shard_tensor.py fix spacing * This is an essential bug fix for a missing import * Update branch to pass CI tests. * This commit provides several pieces: - First, the ability to transpose the sharding dimensions is supported. For square submeshs, 2x2 for example, the output sharding will match the input sharding if it's uneven. This can only be supported if the number of devices in the output mesh dimension is equal to the input dimension, hence the restriction on square submeshes. Other scenarios will apply dtensor-like chunk syntax, but return a shard tensor tracking that split. Comprehensive tests on 1D and 2D meshes are included here. No testing is done at this time on 3D sharding / meshes. - Second, the issues with torch.mean are intercepted and fixed. This uses a new dispatch intercept (below) and applies a weight to the mean, and converts the Partial placement to a Partial(sum) with the weight applied. This has a bug that appears to be present in DTensor too: reductions over non-sharded dimensions appear to falter. To be fixed in a future release. - Third, ShardTensor has a new class attribute to accomodate operator interceptions. The only applied function at this time are variants of aten.mean, however, it is expected to convert all monkey patching to this syntax. * Update monkey patching to ensure patches get applied by modulus, and don't require them to trigger elsewhere. If ShardTensor is used, the patches get applied. Also, minor updates to docs. * Codify ShardTensor and FSDP in tutorials. * Apparently, codify'ing in rst requires double ticks. * This commit fixes gradient propagation for unevenly sharded tensors. Tests are coming in the next commit immediately after. * Add tests for shard tensor: initialization, resharding, and gradient sharding. Further, fixed an annoying bug in other distributed tests where OS environs weren't cleared after testing, and tsome tests would fail but only if others ran first. Now, all distributed tests use a context manager to change OS environment variables locally only. * Two things done here: - Enable dynamic (off by default) wrapping of layers by shard tensor. they get turned on automatically when a shard tensor is created. - Rename the utils to manage env variables. Tests are failing with unusual CPU errors on ORD. Moving to github runners ... * Disable patched operations by default.

@pzharrington

* Enable mesh-based parallelism as the configuration backend, even for simple DDP sharding * Fix small typo in docstring * Remove unnecessary functions with new interface * Adding first implementation of ShardTensor prototype. Still several pieces are WIP but this has basic functionality supported for creation and forward usage. * Working implementation of ShardTensor, though still somewhate incomplete. * Adding work-in-progress examples. Be careful of sharp edges! * A few more example pieces before natten will work out of the box. Most of the ops have been validated, all that remains is to wrap the na2d function call to ensure it will dispatch properly. * Fix naming scheme * Minor name change * Add monkey patching for na2d operation with shard tensors * Fix bug in shard tensor inference of globla size. CHeck agains sharding in unbind op rules. * Enable backwards gradients for halo sharding and natten patch * Convolution 2d backwards works, though would be better to catch torch.ops.aten.convolution.default. * Fix missing import and ensure tensors are contiguous before allgather_v * Clean up and remove unnecessary noise and printouts for debugging * Unify (and correct!) the sharded convolution implementation. There was also a minor bug in the backward pass that got more pronounced with smaller data: grad inputs were failing to properly collect haloed gradients and add them on the edges. Now fixed. * Remove noise from sharding utils. * For smaller tensors, the alltoall step of halo reductions might be significant overhead. I'm implementing here an option to switch to peer to peer message passing, since it might benefit from stream utilization in layers like natten.na2d. It's a developer choice currently, not a user choice. * Remove shard_utils file, it is a subfolder. * Add modulus ShardTensor api documentation * Clean up doc strings, type annotations and mesh implementation. No significant functionality changes in this commit. * Add significant docstring / type annotation cleanup to ShardTensor. Add `scatter_tensor` function to enable more easy transition to shard tensor. This function allows users to maintain data pipelines (on one rank) and easily scatter that data to a domain mesh. * Remove neighborhood attention prototypes * Remove the rest of these examples since they are outdated and unnecessary * Mostly, this commit is adding type annotations and doc strings. But also, this adjusts the shard tensor mechanism for tracking shard info to use a dict instead of a list of tuples. * Clean up and document conv patches. No real code changes applied here. * clean up and improve documentation and type hints for shard utils worker functions * Adding basic tests for shard tensor initialization and redistribution. There appears to be one corner case in redistribute to fix. TBD. Tests for grad propogation are coming. * Add full working example of multilevel parallelism with pytorch FSDP and modulus ShardTensor * Add missing type annotations * Ensure scatter_tensor is available to import from modulus.distributed * Update changelog and ensure wrapt is a optional dependency * Update fsdp_and_shard_tensor.rst Update tutorial based on feedback from @pzharrington * Update __init__.py Remove wildcard import. * Update shard_tensor.py fix spacing * This is an essential bug fix for a missing import * Update branch to pass CI tests. * This commit provides several pieces: - First, the ability to transpose the sharding dimensions is supported. For square submeshs, 2x2 for example, the output sharding will match the input sharding if it's uneven. This can only be supported if the number of devices in the output mesh dimension is equal to the input dimension, hence the restriction on square submeshes. Other scenarios will apply dtensor-like chunk syntax, but return a shard tensor tracking that split. Comprehensive tests on 1D and 2D meshes are included here. No testing is done at this time on 3D sharding / meshes. - Second, the issues with torch.mean are intercepted and fixed. This uses a new dispatch intercept (below) and applies a weight to the mean, and converts the Partial placement to a Partial(sum) with the weight applied. This has a bug that appears to be present in DTensor too: reductions over non-sharded dimensions appear to falter. To be fixed in a future release. - Third, ShardTensor has a new class attribute to accomodate operator interceptions. The only applied function at this time are variants of aten.mean, however, it is expected to convert all monkey patching to this syntax. * Update monkey patching to ensure patches get applied by modulus, and don't require them to trigger elsewhere. If ShardTensor is used, the patches get applied. Also, minor updates to docs. * Codify ShardTensor and FSDP in tutorials. * Apparently, codify'ing in rst requires double ticks. * This commit fixes gradient propagation for unevenly sharded tensors. Tests are coming in the next commit immediately after. * Add tests for shard tensor: initialization, resharding, and gradient sharding. Further, fixed an annoying bug in other distributed tests where OS environs weren't cleared after testing, and tsome tests would fail but only if others ran first. Now, all distributed tests use a context manager to change OS environment variables locally only. * Two things done here: - Enable dynamic (off by default) wrapping of layers by shard tensor. they get turned on automatically when a shard tensor is created. - Rename the utils to manage env variables. Tests are failing with unusual CPU errors on ORD. Moving to github runners ... * Disable patched operations by default.

coreyjadams and others added 30 commits December 17, 2024 08:52

Enable mesh-based parallelism as the configuration backend, even for …

6c6a15e

…simple DDP sharding

Fix small typo in docstring

91f2398

Remove unnecessary functions with new interface

e56b0b8

Adding first implementation of ShardTensor prototype. Still several p…

897a464

…ieces are WIP but this has basic functionality supported for creation and forward usage.

Working implementation of ShardTensor, though still somewhate incompl…

f28f537

…ete.

Adding work-in-progress examples. Be careful of sharp edges!

ce99df6

A few more example pieces before natten will work out of the box. Mos…

025d8d2

…t of the ops have been validated, all that remains is to wrap the na2d function call to ensure it will dispatch properly.

Merge branch 'NVIDIA:main' into distributed

b57050a

Fix naming scheme

b0e335c

Minor name change

70b8ce5

Add monkey patching for na2d operation with shard tensors

9f19e36

Fix bug in shard tensor inference of globla size. CHeck agains shardi…

2f06c07

…ng in unbind op rules.

Enable backwards gradients for halo sharding and natten patch

71168a2

Convolution 2d backwards works, though would be better to catch torch…

72af11e

….ops.aten.convolution.default.

Fix missing import and ensure tensors are contiguous before allgather_v

305bad3

Clean up and remove unnecessary noise and printouts for debugging

d79c975

Merge branch 'NVIDIA:main' into distributed

1a4d886

Unify (and correct!) the sharded convolution implementation. There wa…

f7d063a

…s also a minor bug in the backward pass that got more pronounced with smaller data: grad inputs were failing to properly collect haloed gradients and add them on the edges. Now fixed.

Merge branch 'NVIDIA:main' into distributed

03615c5

Remove noise from sharding utils.

350ec41

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

8342905

…tributed

Remove shard_utils file, it is a subfolder.

726ba92

Add modulus ShardTensor api documentation

05ce224

Clean up doc strings, type annotations and mesh implementation. No si…

f84cfd2

…gnificant functionality changes in this commit.

Add significant docstring / type annotation cleanup to ShardTensor.

73464e0

Add `scatter_tensor` function to enable more easy transition to shard tensor. This function allows users to maintain data pipelines (on one rank) and easily scatter that data to a domain mesh.

Remove neighborhood attention prototypes

e4ac9eb

Remove the rest of these examples since they are outdated and unneces…

8a96e7e

…sary

Mostly, this commit is adding type annotations and doc strings.

584857d

But also, this adjusts the shard tensor mechanism for tracking shard info to use a dict instead of a list of tuples.

Clean up and document conv patches.

a8b0592

No real code changes applied here.

coreyjadams added 2 commits February 13, 2025 10:32

Update branch to pass CI tests.

67666f0

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

0021891

…tributed

coreyjadams and others added 5 commits February 18, 2025 09:45

Update monkey patching to ensure patches get applied by modulus, and …

9ae8975

…don't require them to trigger elsewhere. If ShardTensor is used, the patches get applied. Also, minor updates to docs.

Codify ShardTensor and FSDP in tutorials.

8af8cf1

Apparently, codify'ing in rst requires double ticks.

1cc997b

Merge branch 'main' into distributed

6f92553

coreyjadams changed the base branch from main to 0.10.0-rc February 18, 2025 23:02

coreyjadams added 3 commits February 20, 2025 08:57

This commit fixes gradient propagation for unevenly sharded tensors. …

03f692f

…Tests are coming in the next commit immediately after.

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

49c65f6

…tributed

Two things done here:

89b3903

- Enable dynamic (off by default) wrapping of layers by shard tensor. they get turned on automatically when a shard tensor is created. - Rename the utils to manage env variables. Tests are failing with unusual CPU errors on ORD. Moving to github runners ...

Merge branch '0.10.0-rc' into distributed

91fd5ee

coreyjadams added 2 commits February 20, 2025 14:16

Disable patched operations by default.

6ea37ed

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

ee4c6f9

…tributed

akshaysubr approved these changes Feb 20, 2025

View reviewed changes

pzharrington approved these changes Feb 20, 2025

View reviewed changes

coreyjadams merged commit e735f11 into NVIDIA:0.10.0-rc Feb 20, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Domain Parallelism with ShardTensor #784

Enable Domain Parallelism with ShardTensor #784

coreyjadams commented Feb 6, 2025 •

edited

Loading

coreyjadams commented Feb 13, 2025

ktangsali commented Feb 13, 2025

ktangsali commented Feb 14, 2025

coreyjadams commented Feb 18, 2025

coreyjadams commented Feb 18, 2025

coreyjadams commented Feb 19, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 20, 2025

Enable Domain Parallelism with ShardTensor #784

Enable Domain Parallelism with ShardTensor #784

Conversation

coreyjadams commented Feb 6, 2025 • edited Loading

Modulus Pull Request

Description

Checklist

Dependencies

coreyjadams commented Feb 13, 2025

ktangsali commented Feb 13, 2025

ktangsali commented Feb 14, 2025

coreyjadams commented Feb 18, 2025

coreyjadams commented Feb 18, 2025

coreyjadams commented Feb 19, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 20, 2025

coreyjadams commented Feb 6, 2025 •

edited

Loading