Questions about KV loading #758

carryyu · 2025-01-27T08:12:46Z

Thanks for your work!

I noticed that in the Sparse implementation, TMA is not used for KV loading. Has there been any performance comparison between this approach and the use of TMA in FA3?

yzh119 · 2025-01-27T08:19:18Z

You can change the codebase to use ampere style LDGSTS for contiguous KV-Cache.

As far as I remember, with CUDA 12.4, using LDGSTS can get around 570+ TFLOPs/s for long context causal attention while using TMA can get you ~600 TFLOPs/s. CUDA 12.3 might get you a higher TFLOPs/s because I haven't tried it by myself.

carryyu · 2025-01-27T08:19:48Z

谢谢，已收到！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about KV loading #758

Questions about KV loading #758

carryyu commented Jan 27, 2025

yzh119 commented Jan 27, 2025

carryyu commented Jan 27, 2025 via email

Questions about KV loading #758

Questions about KV loading #758

Comments

carryyu commented Jan 27, 2025

yzh119 commented Jan 27, 2025

carryyu commented Jan 27, 2025 via email