[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

danielvegamyhre · 2025-08-28T15:08:47Z

Stacked PRs:

->[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

This way we can eliminate the .contiguous() call, as well as use the dim1 cast CUDA kernel which is the fastest option (vs inductor codgen or triton).

pytorch-bot · 2025-08-28T15:08:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2897

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

❌ 2 Cancelled Jobs, 4 Pending

As of commit e64d4b5 with merge base 83a20c7 ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Run 1xL4 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh)
Run TorchAO Experimental Tests / test-cpu-ops (linux.arm64.2xlarge) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

vkuzo · 2025-08-29T17:49:57Z

torchao/prototype/moe_training/scaled_grouped_mm.py

+    MXGemmKernelChoice,
+    ScaleCalculationMode,
+)
+from torchao.prototype.mx_formats.mx_linear import _to_mxfp8_dim1_kernel_wrapper


nit: move to utils file if you want to reuse it

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

9c2f8a2

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from c4c957f to 9c2f8a2 Compare August 28, 2025 15:08

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 15:13

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

e8e19bc

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 9c2f8a2 to e8e19bc Compare August 28, 2025 15:13

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 15:13

danielvegamyhre force-pushed the danielvegamyhre/stack/63 branch from f451be2 to 95c3e51 Compare August 28, 2025 15:15

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

22facfa

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from e8e19bc to 22facfa Compare August 28, 2025 15:15

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 15:30

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

fd521e8

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 22facfa to fd521e8 Compare August 28, 2025 15:30

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 15:30

danielvegamyhre requested review from vkuzo and drisspg August 28, 2025 15:34

danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes labels Aug 28, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 16:49

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 16:49

danielvegamyhre mentioned this pull request Aug 28, 2025

[mxfp8 moe training] integrate blocked scale kernels into training code #2900

Closed

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 17:18

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

a2ff75b

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from fd521e8 to a2ff75b Compare August 28, 2025 17:18

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 17:18

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 17:26

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from a2ff75b to fd521e8 Compare August 28, 2025 17:26

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 17:26

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 19:34

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

a12b575

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from fd521e8 to a12b575 Compare August 28, 2025 19:34

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 19:34

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 28, 2025 21:38

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

2ea5a0a

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from a12b575 to 2ea5a0a Compare August 28, 2025 21:38

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/63 August 28, 2025 21:38

danielvegamyhre force-pushed the danielvegamyhre/stack/63 branch from c388a0e to 0f0598a Compare August 28, 2025 21:42

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

843448d

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 2ea5a0a to 843448d Compare August 28, 2025 21:42

danielvegamyhre added a commit that referenced this pull request Aug 29, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

02e246a

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 843448d to 02e246a Compare August 29, 2025 17:16

danielvegamyhre changed the base branch from danielvegamyhre/stack/63 to main August 29, 2025 17:16

vkuzo reviewed Aug 29, 2025

View reviewed changes

vkuzo approved these changes Aug 29, 2025

View reviewed changes

[mxfp8 moe training] use dim1 cast cuda kernel in bwd

e64d4b5

stack-info: PR: #2897, branch: danielvegamyhre/stack/64

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 02e246a to e64d4b5 Compare August 29, 2025 18:53

danielvegamyhre merged commit 083d0c3 into main Aug 29, 2025
16 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

danielvegamyhre commented Aug 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

vkuzo Aug 29, 2025

Uh oh!

danielvegamyhre Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

Conversation

danielvegamyhre commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2897

❗ 1 Active SEVs

❌ 2 Cancelled Jobs, 4 Pending

Uh oh!

vkuzo Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Aug 28, 2025 •

edited

Loading

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading