[roadmap/tracker] Low precision MoE training

Creating this issue as a roadmap/tracker for enabling float8 training for MoEs with token-choice routing. Both core requirements as well as ideas for additional performance optimizations are included.

**UPDATE** 07/22/2025: revised priorities to reflect shifting focus from fp8 rowwise => fp8 blockwise and mxfp8

**This is not an exhaustive list, but highlights some primary milestones / requirements**

## Compute
- [ ] fp8 rowwise
    - [x] Add torch._scaled_grouped_mm kernel in core 
        - https://github.com/pytorch/pytorch/pull/150374 (by @ngimel)
    - [x] Add differentiable scaled grouped mm with dynamic float8 rowwise quant in torchao
        - https://github.com/pytorch/ao/pull/1969
    - [x] Add custom kernels in torchao for performing per-group scaling on device, to avoid host-device sync
        - https://github.com/pytorch/ao/pull/2064
        - https://github.com/pytorch/ao/pull/2077
    - [ ] Faster inductor codegen kernels for dynamic quant of 3d tensors along dim1: https://github.com/pytorch/pytorch/issues/159769
        - [ ] alternatively, handwritten triton kernel faster than torch.compile for this (https://github.com/pytorch/ao/pull/2696)
            - [ ] this also needs to be faster https://github.com/pytorch/ao/issues/2880  
- [ ] fp8 blockwise
    - [ ] quant primitives
    - [ ] DeepGEMM integration for fp8 blockwise grouped GEMM
    - [ ] triton kernels to do scaling per group without d2h sync
- [ ] mxpf8  
    - [ ] mxfp8 scaled grouped gemm https://github.com/pytorch/pytorch/issues/153502
        - [X] 2d-3d gemm for output and dX (#2848)
        - [ ] 2d-2d gemm for dW 
    - [x] torchao differentiable _scaled_grouped_mm support for mxpf8 recipe for dynamic quant before grouped GEMMs
    - [ ] triton kernels for per token group scale conversion to blocked swizzled format
        - [x] for 2d inputs (#2886)
        - [ ] for 3d expert weights 

## Communication
I looked at traces and validated "all to all dispatch -> grouped gemm -> all to all combine" are all sequentially dependent, so in theory faster/low precision comms should improve performance. There is some overlap with the shared expert computation, but it is not 100% overlap, so there is room for optimization. This will be especially important if/when "all to all" spans multiple nodes, where inter-node network bandwidth is lower than the intra-node NVLink bandwidth. 

This is also inspired by the DeepSeekV3 [paper](https://arxiv.org/pdf/2412.19437) where, if I understand correctly, they do a2a dispatch in fp8 but keep a2a combine in bf16 as they found it was more sensitive to low precision during training.

- [ ] Add on device [all_to_all_v](https://github.com/pytorch/torchtitan/blob/f27a1843a503fadf06876a3797bd7305098917a7/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py#L56) kernels compatible with:
    - [ ] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [ ] float8 rowwise (P1)
 - [ ] token permutation kernel supports low precision dtypes by permuting scales to be in proper order for permuted tokens ([link](https://github.com/pytorch/torchtitan/blob/f27a1843a503fadf06876a3797bd7305098917a7/torchtitan/experiments/deepseek_v3/model.py#L833C52-L833C76))
    - [ ] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [ ] float8 rowwise (P1)
 
## Torchao UX
- [X] Add tensor subclass (ScaledGroupedMMTensor) with an op override for `torch.aten._grouped_mm` => runs differentiable scaled grouped mm
    - https://github.com/pytorch/ao/pull/2275
- [X] Add one line model conversion API, should recursively swap nn.Parameter data tensors of the expert weights with ScaledGroupedMMTensor. 
    - https://github.com/pytorch/ao/pull/2275
- [X] support configurable recipe (fp8 blockwise/rowwise, mxpf8) 

## Compile support
- [x] Compile support for `torch._grouped_mm`
    - done by @bdhirsh in https://github.com/pytorch/pytorch/pull/153384
- [X] Differentiable _scaled_grouped_mm can compile with `fullgraph=True`
- [X] E2E compilation of each TranformerBlock in torchtitan after MoE conversion via tensor subclass approach (fullgraph=False)
- [ ] E2E compilation of each TranformerBlock in torchtitan after MoE conversion via tensor subclass approach (fullgraph=True)

## Distributed support
- [x] Composability with FSDP2 (will likely need something like [this](https://github.com/pytorch/ao/blob/1017c7e3bfe7300a14ed81fa36038684b168b633/torchao/float8/fsdp_utils.py#L129) for the new tensor subclass)
    - [x] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2413 
- [ ] Composability with TP
    - [x] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2473
- [ ] Composability with FSDP + TP
    - [ ] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2475
- [ ] Composability with dp2ep as implemented here: https://github.com/pytorch/torchtitan/pull/1324
    - [ ] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2481


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[roadmap/tracker] Low precision MoE training #2147

Compute

Communication

Torchao UX

Compile support

Distributed support

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[roadmap/tracker] Low precision MoE training #2147

Description

Compute

Communication

Torchao UX

Compile support

Distributed support

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions