Replies: 1 comment
-
It‘s just a tiling. For example, the original shape of Mul is M=512,N=512,K=512, you can split 4 Blocks to mul. In each Block, shape is M=256,K=512,N=256, if device has enough resource, it does't need to split again.But when shape is big, the memory can't hold MK or KN data, so it need data exchange between register and ddr. So spliting M or N can solve above problems. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While studying the Triton Tutorial, I found the "Super-Grouping" code in the Matrix Multiplication section confusing.
Below is the code related to grouping:
I'm wondering the second last code:
pid_m = first_pid_m + (pid % group_size_m)
Intuitively, the code should be
pid_m = first_pid_m + (pid % num_pid_in_group % group_size_m)
, correct?How can we eliminate the
num_pid_in_group
to ensure that pid % group_size_m is equal to pid % num_pid_in_group % group_size_m?I have found a combination that makes this equation invalid:
However, it breaks the requirement that BLOCK_SIZE_M, BLOCK_SIZE_N, GROUP_SIZE_M should be powers of 2.
Anyone can help proving the equality when fullfilling all the prerequisite?
Beta Was this translation helpful? Give feedback.
All reactions