[RFC] Improve coalescing/layout conversion logic #2007
davidberard98
started this conversation in
Ideas
Replies: 1 comment
-
Hello, I recently found a similar problem. Layout Conversion happened in tt.store, it will cause bank conflict sometime, and it will hurt performance. How is this RFC going? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TL;DR: We have a use case for a pointwise operation where layout conversions appear to hurt performance. This is a proposal to change the coalescing pass logic to reduce the number of layout conversions. Before we look further into this idea, we’d appreciate feedback on whether this idea sounds good, whether a PR to fix this would be acceptable (e.g. I’ve seen some comments that this code path may be modified significantly as part of Hopper work), and whether there might be other side effects from this change that we didn’t foresee.
The proposal is to provide more flexibility when deciding layouts; currently, most ops are assigned a default layout while certain other ops (e.g. loads) are assigned special layouts - which means that a layout conversion is required for each of the layout conversions. Instead, we suggest choosing layouts for unspecified ops in a way that reduces the number of layout conversions. More details are shown below.
Motivation - demonstration of excess layout conversions
TL;DR: In the example kernel, the majority of the instructions are run in “blocked1” layout; but some of the loads are converted to “blocked” layout. The layout conversion between “blocked” and “blocked1” appears to increase the latency for this kernel. Initially on A100, latency is 92us; after removing the layout conversion (with a hacky patch), latency is 67us.
The example kernel I’m testing with is linked here: https://gist.github.com/davidberard98/c0cc39f3a2324936abbfe5d8c98eba48 - the triton kernel section is shown inline below:
Logically, this kernel does the following:
The corresponding TTGIR shows some layout conversions:
It contains 4 layout conversions:
Note that the “blocked” layout is needed for the loads and stores in order to enable vectorization.
In a patch (only applicable for this specific kernel), we tried converting the layout to “blocked” everywhere, which eliminates the need for layout conversions. This patch shows a speedup from 92us to 67us.
Proposed changes to coalescing & layout conversions
My understanding of coalescing and layout conversions:
Proposal: instead of assigning the default blocked layout to all other tensor values, we can first assign layouts for ops that require specific layouts (like loads and stores), and then choose the layouts for other ops in order to reduce/minimize the number of layout conversions. I haven't looked closely enough to have specific details, but one example for how to do this is described below:
Beta Was this translation helpful? Give feedback.
All reactions