[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses #2864

danielvegamyhre · 2025-08-23T23:42:25Z

Stacked PRs:

[moe fp8 training] fused reduction kernel along dim1 for 3d expert weights in backward #2865
->[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses #2864
[moe fp8 training] test and bench new faster method for per group rowwise scaling #2863

[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses

Summary

Integrate new per group rowwise scaling method into MoE training and update benchmarks

Benchmarks

We now see a ~10% TPS increase over bf16 when experts per device = 2, which declines gradually until we reach ~1% speedup at experts per device = 16 (EP=1).

Benchmarks below use 2 layer Llama4 debug model with dim=5120 (full size) and torch.compile.

Experts per device	FSDP degree	Dtype	Median Tokens/Second	Max Memory Usage (GiB)	Speedup vs. BF16
2	2	BF16	39003.0	45.12	-
	2	FP8	43062.0	45.03	10.4%
4	2	BF16	37238.0	50.04	-
	2	FP8	40027.5	49.83	7.5%
8	2	BF16	34851.5	59.87	-
	2	FP8	36867.0	60.98	5.7%
16	4	BF16	32673.0	63.48	-
	4	FP8	33282.5	59.43	1.08%

FP8 dense only versus dense + moe

2 experts per device, 2 devices: https://www.internalfb.com/phabricator/paste/view/P1919493833
- Dense: 2.7% speedup
- Dense + MoE: 7.2% speedup
- Not sure why we only got 7.2% instead of 10.4% like yesterday's runs, but still shows MoE is the majority of the speedup. I have noticed I get slightly different numbers depending on which CUDA_VISIBLE_DEVICES I set.

pytorch-bot · 2025-08-23T23:42:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2864

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit d3830dc with merge base 253d65a ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio fbgemm-gpu-genai --index-url https... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_hp_float
Run 1xL4 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_hp_float
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_hp_float

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…alesced gmem accesses stack-info: PR: #2864, branch: danielvegamyhre/stack/58

danielvegamyhre force-pushed the danielvegamyhre/stack/57 branch from afd9cb6 to 7af9f68 Compare August 23, 2025 23:42

danielvegamyhre force-pushed the danielvegamyhre/stack/58 branch from 3848c56 to cf93326 Compare August 23, 2025 23:42

This was referenced Aug 23, 2025

[moe fp8 training] test and bench new faster method for per group rowwise scaling #2863

Open

[moe fp8 training] fused reduction kernel along dim1 for 3d expert weights in backward #2865

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 23, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/57 to main August 24, 2025 00:13

danielvegamyhre force-pushed the danielvegamyhre/stack/58 branch from cf93326 to ee9f562 Compare August 24, 2025 00:13

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/57 August 24, 2025 00:14

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 24, 2025

[moe fp8 training] use transpose method when quantizing to avoid unco…

d3830dc

…alesced gmem accesses stack-info: PR: #2864, branch: danielvegamyhre/stack/58

danielvegamyhre changed the base branch from danielvegamyhre/stack/57 to main August 24, 2025 01:08

danielvegamyhre force-pushed the danielvegamyhre/stack/58 branch from ee9f562 to d3830dc Compare August 24, 2025 01:08

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/57 August 24, 2025 01:08

danielvegamyhre mentioned this pull request Aug 26, 2025

[MoE fp8 rowwise training] Runtime of quantizing 3d expert weights scales worse than linearly #2880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses #2864

[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses #2864

danielvegamyhre commented Aug 23, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses #2864

Are you sure you want to change the base?

[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses #2864

Conversation

danielvegamyhre commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

FP8 dense only versus dense + moe

Uh oh!

pytorch-bot bot commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2864

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

Uh oh!

danielvegamyhre commented Aug 23, 2025 •

edited

Loading

pytorch-bot bot commented Aug 23, 2025 •

edited

Loading