fbgemm mxfp8 grouped gemm issues

Documenting issues with mxfp8 grouped gemm and repro commands:

**Prerequisite**: install torch and fbgem-gpu-genai nightlies (CUDA 12.8 on B200): `pip3 install --pre torch fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/nightly/cu128`

1. Using uniform group sizes, test cases pass for M=2048 but CUDA illegal memory access for other values like M=1024 or 16640.
    - Repro:
        - Checkout torchao PR: https://github.com/pytorch/ao/pull/2848
        - Verify current unit tests pass, enabling logging so we can log input shapes/strides: `pytest test/prototype/moe_training/test_scaled_grouped_mm.py -k test_mxfp8_grouped_gemm_with_dq_fwd_bwd -s --log-cli-level=INFO`
        -  In test `test_mxfp8_grouped_gemm_with_dq_fwd_bwd` change "M" to 1024. Rerun pytest command above, and observe CUDA errors.
 
2. Using non-uniform group sizes results in CUDA illegal memory access errors
    - Repro:
        - Checkout torchao PR: https://github.com/pytorch/ao/pull/2848
        - Verify current unit tests pass (same as above), enabling logging so we can log input shapes/strides: `pytest test/prototype/moe_training/test_scaled_grouped_mm.py -k test_mxfp8_grouped_gemm_with_dq_fwd_bwd -s --log-cli-level=INFO`
        - In same test as above `test_mxfp8_grouped_gemm_with_dq_fwd_bwd`, change group offsets to randomly generated ones (using multiple of 32) by commenting out the line `offs = torch.arange(...)` and then un-commenting the line `offs = generate_jagged_offs(...)`
        - Rerun unit test, observe CUDA mem access errors.
    - For this one, I suspect the issue may be in my `to_blocked_per_group_2d` or `to_blocked_per_group_3d` functions, which convert MXFP8 e8m0 scales to a blocked format on a per-token-group basis. I implemented these functions by using the [FBGEMM unit test](https://github.com/pytorch/FBGEMM/blob/54e245e17b132f7a3fc521b0a9a1249b1d0b1775/fbgemm_gpu/experimental/gen_ai/test/quantize/quantize_test.py#L1235) as a reference, but the unit test only exercises uniform group sizes, so there could be a gap. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fbgemm mxfp8 grouped gemm issues #2877

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fbgemm mxfp8 grouped gemm issues #2877

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions