Optimize triton flashMLA for Iluvatar GPU #1188
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimizations have been implemented on the Iluvatar GPU for scenarios that meet two conditions:
The implementation ensures that the kv_cache loaded each time belongs to the same page, thus being contiguous. As a result, the QKV load can be faster. Additionally, the tile size has been adjusted according to the hardware characteristics, leading to more reasonable resource utilization and overall better performance. Compared with the original kernel implementation that does not meet the above conditions( only modify num_stages=1, keep num_warps = 4), in the benchmark test with an input of 6K and an output of 2K, the performance has been improved by approximately 11 times.