[MoE fp8 rowwise training] Runtime of quantizing 3d expert weights scales worse than linearly

- Kernel name: [_triton_fp8_rowwise_3d_transpose_cast_rhs_kernel](https://github.com/pytorch/ao/blob/9f1e32ba0499f7ffe0124162ce172a3fadb1218a/torchao/prototype/moe_training/kernels/float8_rowwise.py#L208)
    - inputs: column-major input tensor of shape (E,K,N) and scales tensor of shape (E,K) 
    - outputs: column-major output tensor **transposed** casted to fp8 rowwise - shape (E,N,K) 
- Repro:
    - Checkout this PR: https://github.com/pytorch/ao/pull/2864
    - Option 1: Run test for this kernel: `pytest test/prototype/moe_training/test_kernels.py  -k test_fp8_rowwise_3d_transpose_rhs_atomic`
    - Option 2: Run bench comparing this kernel to other implementations: `python benchmarks/prototype/moe_training/benchmark_rowwise_3d_quant_kernels.py `
        - This will run 3 implementations to compare perf, but I am only concerned about the `_triton_fp8_rowwise_3d_transpose_scales_rhs_kernel`
- TTIR: https://www.internalfb.com/phabricator/paste/view/P1918713582
- TTGIR: https://www.internalfb.com/phabricator/paste/view/P1918714134
- Warnings/remarks: https://www.internalfb.com/phabricator/paste/view/P1918025647




### Lines with bank conflicts



<img width="903" height="124" alt="Image" src="https://github.com/user-attachments/assets/057c4b78-d56e-4877-a391-27cb3307f0d8" />


<img width="876" height="121" alt="Image" src="https://github.com/user-attachments/assets/01662587-87e3-4454-add8-2cbc8568ab97" />


<img width="876" height="122" alt="Image" src="https://github.com/user-attachments/assets/b5116649-5304-4744-a635-e5212180154a" />


## Additional context
With https://github.com/pytorch/ao/pull/2864 we have modest speedups for MoE fp8 rowwise training, when experts per device is <= 16.

I did some perf analysis to determine why perf regressed as number of experts grew, and found 2 specific kernels are the culprits - and both are for quantizing the 3d expert weights tensor (1st time in forward for `out = input @ weight.t()` and 2nd time the non-transposed tensor for `grad_input = grad_output_t @ weight`).

When scaling up from 4 experts to 16 experts, kernels quantizing the input activations have the same runtime as expected, since inputs are the same size. However, the 2 kernels quantizing the weight described above take 6x as long (for weights that are only 4x as big).

1. The kernel used in forward pass to quantize weight^T is codegen by inductor.
2. The kernel used in backward pass to quantize weight (non transpose) is handwritten ([this one](https://github.com/pytorch/ao/blob/ba111b0e8b587facff0826f71d7b6e1c8225af89/torchao/prototype/moe_training/kernels/float8_rowwise.py#L208)) since inductor was too slow. This kernel is ~20% faster than inductor but still scales poorly, as described above.

NCU flags shows 71% memory bandwidth utilization, but flags (1) scoreboard stalls / bank conflicts and (2) low occupancy due to register requirements:


<img width="1589" height="157" alt="Image" src="https://github.com/user-attachments/assets/1177635a-be4b-4645-a0aa-2c5df0a84a0b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE fp8 rowwise training] Runtime of quantizing 3d expert weights scales worse than linearly #2880

Lines with bank conflicts

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MoE fp8 rowwise training] Runtime of quantizing 3d expert weights scales worse than linearly #2880

Description

Lines with bank conflicts

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions