[FA] Unifying TMA Kernels with Warp Specialization Flag #232

codingwithsurya · 2025-05-27T18:06:44Z

Summary:
This PR consolidates redundant TMA attention kernels into a unified implementation. Previously, _attn_fwd_tma and _attn_fwd_tma_ws contained duplicate code (mainly the TMA descriptors) and didn't leverage the existing ENABLE_WS flag.

I've merged the redundant kernels into a single _attn_fwd_tma_unified kernel. We now use the ENABLE_WS flag to toggle between regular and warp-specialized execution.

Changes:

Merged both kernels into _attn_fwd_tma_unified kernel with handling of regular and warp-specialized paths
Utilized existing ENABLE_WS parameter to control warp specialization
Unified TMA descriptor creation logic

Test Plan:
Unit Tests and Benchmarking

WITH_TMA=1  python run.py --op flash_attention --only triton_tutorial_flash_v2_tma_ws --num-inputs 1 --seq-len 8192 --metrics tflops --batch 8 --n-heads 16 --d-head 128
WITH_TMA=1  python run.py --op flash_attention --only triton_tutorial_flash_v2_tma --num-inputs 1 --seq-len 8192 --metrics tflops --batch 8 --n-heads 16 --d-head 128
python -m unittest test/test_gpu/main.py -k test_gpu_tritonbench_flash_attention

The performance metrics before and after the code change are identical (4.39805e+12 FLOPS).

Follow Up PR for Base + Opt Kernels: #233

Differential Revision: D75307125 and D75308966 (follow-up diff)

facebook-github-bot · 2025-05-27T18:06:57Z

This pull request was exported from Phabricator. Differential Revision: D75308966

Summary: Separated the TMA kernel variant handling into distinct code paths rather than using a conditional parameter. Changed from a unified approach with a dynamic `is_warp_specialized` flag to explicit separate conditions for `tma` and `tma_ws` variants. This improves code clarity by making the execution path more explicit + makes it easier for compiler to optimize. Differential Revision: D75308966

Summary: This PR consolidates redundant TMA attention kernels into a unified implementation. Previously, `_attn_fwd_tma` and `_attn_fwd_tma_ws` contained duplicate code (mainly the TMA descriptors) and didn't leverage the existing `ENABLE_WS` flag. I've merged the redundant kernels into a single `_attn_fwd_tma_unified` kernel. We now use the `ENABLE_WS` flag to toggle between regular and warp-specialized execution. Changes: * Merged both kernels into `_attn_fwd_tma_unified` kernel with handling of regular and warp-specialized paths * Utilized existing `ENABLE_WS` parameter to control warp specialization * Unified TMA descriptor creation logic Differential Revision: D75307125

Summary: Separated the TMA kernel variant handling into distinct code paths rather than using a conditional parameter. Changed from a unified approach with a dynamic `is_warp_specialized` flag to explicit separate conditions for `tma` and `tma_ws` variants. This improves code clarity by making the execution path more explicit + makes it easier for compiler to optimize. Differential Revision: D75308966

facebook-github-bot · 2025-05-27T21:47:20Z

This pull request was exported from Phabricator. Differential Revision: D75308966

manman-ren · 2025-05-28T17:04:25Z

Thanks for working on this! The patch looks good. We have been focusing on non-causal, so maybe testing

WITH_TMA=1 CUDA_VISIBLE_DEVICES=5 python run.py --op flash_attention --only triton_tutorial_flash_v2_tma_ws --num-inputs 1 --seq-len 8192 --metrics tflops --batch 8 --n-heads 16 --d-head 128
WITH_TMA=1 CUDA_VISIBLE_DEVICES=5 python run.py --op flash_attention --only triton_tutorial_flash_v2_tma --num-inputs 1 --seq-len 8192 --metrics tflops --batch 8 --n-heads 16 --d-head 128

We need WITH_TMA=1 to actually enable tma.
If you haven't imported the diff to fbsource yet, please do it, it will trigger tests with our internal Triton.

codingwithsurya · 2025-05-28T20:28:45Z

Thanks for working on this! The patch looks good. We have been focusing on non-causal, so maybe testing
WITH_TMA=1 CUDA_VISIBLE_DEVICES=5 python run.py --op flash_attention --only triton_tutorial_flash_v2_tma_ws --num-inputs 1 --seq-len 8192 --metrics tflops --batch 8 --n-heads 16 --d-head 128
WITH_TMA=1 CUDA_VISIBLE_DEVICES=5 python run.py --op flash_attention --only triton_tutorial_flash_v2_tma --num-inputs 1 --seq-len 8192 --metrics tflops --batch 8 --n-heads 16 --d-head 128
We need WITH_TMA=1 to actually enable tma. If you haven't imported the diff to fbsource yet, please do it, it will trigger tests with our internal Triton.

Thanks for letting me know! I have tested it with WITH_TMA=1 and it works. I have updated the test plan.

I exported it from fbsource. For reference, the diff in fbsource is here (this specific PR is the first two diffs in the stack).

mandroid6

LGTM!

codingwithsurya had a problem deploying to docker-s3-upload May 27, 2025 18:06 — with GitHub Actions Error

facebook-github-bot added the cla signed label May 27, 2025

facebook-github-bot added the fb-exported label May 27, 2025

codingwithsurya changed the title ~~Refactor TMA kernel variant handling for improved readability (#231)~~ [FA] Unifying TMA Kernels with Warp Specialization Flag May 27, 2025

codingwithsurya requested review from mandroid6 and manman-ren May 27, 2025 18:14

codingwithsurya added 2 commits May 27, 2025 14:46

facebook-github-bot force-pushed the export-D75308966 branch from a89de27 to fba09d9 Compare May 27, 2025 21:46

facebook-github-bot had a problem deploying to docker-s3-upload May 27, 2025 21:46 — with GitHub Actions Error

codingwithsurya closed this May 27, 2025

codingwithsurya reopened this May 27, 2025

codingwithsurya had a problem deploying to docker-s3-upload May 27, 2025 22:17 — with GitHub Actions Error

codingwithsurya mentioned this pull request May 28, 2025

[FA] Unify Base + Opt FWD kernels #233

Open

codingwithsurya self-assigned this May 28, 2025

mandroid6 approved these changes May 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FA] Unifying TMA Kernels with Warp Specialization Flag #232

[FA] Unifying TMA Kernels with Warp Specialization Flag #232

codingwithsurya commented May 27, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

manman-ren commented May 28, 2025

Uh oh!

codingwithsurya commented May 28, 2025 •

edited

Loading

Uh oh!

mandroid6 left a comment

Uh oh!

Uh oh!

[FA] Unifying TMA Kernels with Warp Specialization Flag #232

Are you sure you want to change the base?

[FA] Unifying TMA Kernels with Warp Specialization Flag #232

Conversation

codingwithsurya commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

manman-ren commented May 28, 2025

Uh oh!

codingwithsurya commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mandroid6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codingwithsurya commented May 27, 2025 •

edited

Loading

codingwithsurya commented May 28, 2025 •

edited

Loading