[mxfp8 moe training] add per group blocked scale kernels #2886

danielvegamyhre · 2025-08-27T02:57:48Z

Stacked PRs:

[mxfp8 moe training] add per group blocked scale kernels for 2d input activations

Summary

We currently use a pytorch loop based impl for converting mxfp8 scales to block swizzled format on a per-group basis
This of course is not optimal for perf and results in d2h sync
This PR implements a Triton kernel which does this conversion, without doing a d2h sync then looping on the host
Note: to simplify the kernel, we pre-compute the start row of each group in the block padded output scales tensor (see compute_per_group_blocked_scale_offsets), but this is just a couple standard torch ops and shouldn't cause a d2h sync. There is probably still room for optimization here by doing this in the kernel somehow, but we'll take things one step at a time.

Test plan

pytest test/prototype/moe_training/test_kernels.py -k blocked

Performance

Low memory bandwidth utilization but 14x faster than existing torch implementation

input_shape      torch_time_us    triton_time_us    torch_mem_bw_gbps    triton_mem_bw_gbps  triton_speedup
-------------  ---------------  ----------------  -------------------  --------------------  ----------------
(16640, 160)           866.848            60.416                6.261                89.831  14.35x

pytorch-bot · 2025-08-27T02:57:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2886

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre · 2025-08-27T16:17:47Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

+    # We track how many row blocks we have iterated through.
+    block_row_id = 0
+    current_start_row = input_group_start_row
+    while current_start_row < input_group_end_row:


Note for reviewer: I think we can probably do this without a loop, and just parallelize across row blocks as well (like in the original impl for dense models). Need to think about it some more.

lets add as a follow up / todo

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre added a commit that referenced this pull request Aug 27, 2025

[mxfp8 moe training] add per group blocked scale kernels

c66c5c0

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2025

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from fe29946 to c66c5c0 Compare August 27, 2025 02:57

This was referenced Aug 27, 2025

[mxfp8 moe] add support for fbgemm 2d-3d mx8mx8bf16 grouped gemm #2848

Merged

[mxfp8 moe training] refactor all var names with suffix _mx to _data for clarity #2879

Merged

[mxfp8 moe training] add grouped gemm benchmark script #2882

Merged

danielvegamyhre marked this pull request as draft August 27, 2025 03:02

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 27, 2025 15:42

danielvegamyhre added a commit that referenced this pull request Aug 27, 2025

[mxfp8 moe training] add per group blocked scale kernels

75ae9d6

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from c66c5c0 to 75ae9d6 Compare August 27, 2025 15:43

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 27, 2025 15:43

danielvegamyhre marked this pull request as ready for review August 27, 2025 15:43

danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes labels Aug 27, 2025

danielvegamyhre changed the title ~~[mxfp8 moe training] add per group blocked scale kernels~~ [mxfp8 moe training] add per group blocked scale kernels for 2d input activations Aug 27, 2025

danielvegamyhre requested a review from drisspg August 27, 2025 15:50

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 27, 2025 15:53

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from 75ae9d6 to 3cf3f8d Compare August 27, 2025 15:53

danielvegamyhre added a commit that referenced this pull request Aug 27, 2025

[mxfp8 moe training] add per group blocked scale kernels

3cf3f8d

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre changed the title ~~[mxfp8 moe training] add per group blocked scale kernels for 2d input activations~~ [mxfp8 moe training] add per group blocked scale kernels Aug 27, 2025

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 27, 2025 15:53

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 27, 2025 16:08

danielvegamyhre added a commit that referenced this pull request Aug 27, 2025

[mxfp8 moe training] add per group blocked scale kernels

d0b4a1e

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from 3cf3f8d to d0b4a1e Compare August 27, 2025 16:08

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 27, 2025 16:08

danielvegamyhre commented Aug 27, 2025

View reviewed changes

danielvegamyhre requested a review from vkuzo August 27, 2025 16:21

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 27, 2025 16:23

danielvegamyhre added a commit that referenced this pull request Aug 27, 2025

[mxfp8 moe training] add per group blocked scale kernels

2d6bae9

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 28, 2025 01:19

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 28, 2025 01:23

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] add per group blocked scale kernels

a174a57

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from b52d7d1 to a174a57 Compare August 28, 2025 01:23

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 28, 2025 01:23

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 28, 2025 01:55

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 28, 2025 01:55

danielvegamyhre mentioned this pull request Aug 28, 2025

[mxfp8 moe training] add triton kernel for blocked swizzled 3d weight scales #2894

Merged

danielvegamyhre changed the title ~~[mxfp8 moe training] add per group blocked scale kernels~~ [mxfp8 moe training] add per group blocked swizzle scale kernels for 2d input scales with group offsets Aug 28, 2025

danielvegamyhre changed the title ~~[mxfp8 moe training] add per group blocked swizzle scale kernels for 2d input scales with group offsets~~ [mxfp8 moe training] add triton kernel for blocked swizzled 2d input scales with group offsets Aug 28, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 28, 2025 04:06

danielvegamyhre changed the title ~~[mxfp8 moe training] add triton kernel for blocked swizzled 2d input scales with group offsets~~ [mxfp8 moe training] add per group blocked scale kernels Aug 28, 2025

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 28, 2025 04:06

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 28, 2025 15:08

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 28, 2025 15:09

danielvegamyhre mentioned this pull request Aug 28, 2025

[mxfp8 moe training] use dim1 cast cuda kernel in bwd #2897

Open

danielvegamyhre force-pushed the danielvegamyhre/stack/61 branch from a08719c to b755921 Compare August 28, 2025 15:09

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 28, 2025 15:13

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from a174a57 to 327db2b Compare August 28, 2025 15:13

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/61 August 28, 2025 15:13

danielvegamyhre added a commit that referenced this pull request Aug 28, 2025

[mxfp8 moe training] add per group blocked scale kernels

8cfccae

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from 327db2b to 8cfccae Compare August 28, 2025 15:15

danielvegamyhre changed the base branch from danielvegamyhre/stack/61 to main August 28, 2025 15:15

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from 8cfccae to 402a30a Compare August 28, 2025 15:30

danielvegamyhre mentioned this pull request Aug 28, 2025

[mxfp8 moe training] integrate blocked scale kernels into training code #2900

Closed

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from 402a30a to 0312c7e Compare August 28, 2025 19:34

danielvegamyhre mentioned this pull request Aug 28, 2025

[mxfp8 moe training] integrate triton kernels for converting scales to blocked format #2902

Draft

[mxfp8 moe training] add per group blocked scale kernels

35a9c69

stack-info: PR: #2886, branch: danielvegamyhre/stack/62

danielvegamyhre force-pushed the danielvegamyhre/stack/62 branch from 0312c7e to 35a9c69 Compare August 28, 2025 21:38

danielvegamyhre merged commit 4ecc89e into main Aug 28, 2025
13 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] add per group blocked scale kernels #2886

[mxfp8 moe training] add per group blocked scale kernels #2886

Uh oh!

danielvegamyhre commented Aug 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Aug 27, 2025 •

edited

Loading

Uh oh!

drisspg Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

[mxfp8 moe training] add per group blocked scale kernels #2886

[mxfp8 moe training] add per group blocked scale kernels #2886

Uh oh!

Conversation

danielvegamyhre commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Performance

Uh oh!

pytorch-bot bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2886

Uh oh!

danielvegamyhre Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Aug 27, 2025 •

edited

Loading

pytorch-bot bot commented Aug 27, 2025 •

edited

Loading

danielvegamyhre Aug 27, 2025 •

edited

Loading