You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that in the Sparse implementation, TMA is not used for KV loading. Has there been any performance comparison between this approach and the use of TMA in FA3?
The text was updated successfully, but these errors were encountered:
You can change the codebase to use ampere style LDGSTS for contiguous KV-Cache.
As far as I remember, with CUDA 12.4, using LDGSTS can get around 570+ TFLOPs/s for long context causal attention while using TMA can get you ~600 TFLOPs/s. CUDA 12.3 might get you a higher TFLOPs/s because I haven't tried it by myself.
Thanks for your work!
I noticed that in the Sparse implementation, TMA is not used for KV loading. Has there been any performance comparison between this approach and the use of TMA in FA3?
The text was updated successfully, but these errors were encountered: