Skip to content

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Aug 23, 2025

Stacked PRs:


[moe fp8 training] use transpose method when quantizing to avoid uncoalesced gmem accesses

Summary

  • Integrate new per group rowwise scaling method into MoE training and update benchmarks

Benchmarks

  • We now see a ~10% TPS increase over bf16 when experts per device = 2, which declines gradually until we reach ~1% speedup at experts per device = 16 (EP=1).

Benchmarks below use 2 layer Llama4 debug model with dim=5120 (full size) and torch.compile.

Experts per device FSDP degree Dtype Median Tokens/Second Max Memory Usage (GiB) Speedup vs. BF16
2 2 BF16 39003.0 45.12 -
2 FP8 43062.0 45.03 10.4%
4 2 BF16 37238.0 50.04 -
2 FP8 40027.5 49.83 7.5%
8 2 BF16 34851.5 59.87 -
2 FP8 36867.0 60.98 5.7%
16 4 BF16 32673.0 63.48 -
4 FP8 33282.5 59.43 1.08%

FP8 dense only versus dense + moe

  • 2 experts per device, 2 devices: https://www.internalfb.com/phabricator/paste/view/P1919493833
    • Dense: 2.7% speedup
    • Dense + MoE: 7.2% speedup
    • Not sure why we only got 7.2% instead of 10.4% like yesterday's runs, but still shows MoE is the majority of the speedup. I have noticed I get slightly different numbers depending on which CUDA_VISIBLE_DEVICES I set.

Copy link

pytorch-bot bot commented Aug 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2864

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit d3830dc with merge base 253d65a (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/57 branch from afd9cb6 to 7af9f68 Compare August 23, 2025 23:42
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/58 branch from 3848c56 to cf93326 Compare August 23, 2025 23:42
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 23, 2025
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/57 to main August 24, 2025 00:13
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/58 branch from cf93326 to ee9f562 Compare August 24, 2025 00:13
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/57 August 24, 2025 00:14
@danielvegamyhre danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 24, 2025
…alesced gmem accesses

stack-info: PR: #2864, branch: danielvegamyhre/stack/58
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/57 to main August 24, 2025 01:08
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/58 branch from ee9f562 to d3830dc Compare August 24, 2025 01:08
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/57 August 24, 2025 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant