Super-Grouping code Question #3016

zh-plus · 2024-01-25T09:48:01Z

zh-plus
Jan 25, 2024

While studying the Triton Tutorial, I found the "Super-Grouping" code in the Matrix Multiplication section confusing.

Below is the code related to grouping:

# Program ID
pid = tl.program_id(axis=0)
# Number of program ids along the M axis
num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
# Number of programs ids along the N axis
num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
# Number of programs in group
num_pid_in_group = GROUP_SIZE_M * num_pid_n
# Id of the group this program is in
group_id = pid // num_pid_in_group
# Row-id of the first program in the group
first_pid_m = group_id * GROUP_SIZE_M
# If `num_pid_m` isn't divisible by `GROUP_SIZE_M`, the last group is smaller
group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
# *Within groups*, programs are ordered in a column-major order
# Row-id of the program in the *launch grid*
pid_m = first_pid_m + (pid % group_size_m)
# Col-id of the program in the *launch grid*
pid_n = (pid % num_pid_in_group) // group_size_m

I'm wondering the second last code:
pid_m = first_pid_m + (pid % group_size_m)

Intuitively, the code should be pid_m = first_pid_m + (pid % num_pid_in_group % group_size_m), correct?
How can we eliminate the num_pid_in_group to ensure that pid % group_size_m is equal to pid % num_pid_in_group % group_size_m?

I have found a combination that makes this equation invalid:

def cdiv(a, b):
    return (a + b - 1) // b


def get_pidmn(M, N, BLOCK_SIZE_M, BLOCK_SIZE_N, GROUP_SIZE_M, pid):
    num_pid_m = cdiv(M, BLOCK_SIZE_M)
    num_pid_n = cdiv(N, BLOCK_SIZE_N)
    num_pid_in_group = GROUP_SIZE_M * num_pid_n
    group_id = pid // num_pid_in_group
    first_pid_m = group_id * GROUP_SIZE_M
    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)

    test_pid_m = first_pid_m + (pid % num_pid_in_group % group_size_m)
    pid_m = first_pid_m + (pid % group_size_m)
    assert test_pid_m == pid_m
    
    pid_n = (pid % num_pid_in_group) // group_size_m

    print(pid_m, pid_n, sep=', ')


if __name__ == '__main__':
    get_pidmn(14, 18, 3, 2, 3, 30)

However, it breaks the requirement that BLOCK_SIZE_M, BLOCK_SIZE_N, GROUP_SIZE_M should be powers of 2.

Anyone can help proving the equality when fullfilling all the prerequisite?

gavin838 · 2024-09-09T02:20:09Z

gavin838
Sep 9, 2024

It‘s just a tiling. For example, the original shape of Mul is M=512,N=512,K=512, you can split 4 Blocks to mul. In each Block, shape is M=256,K=512,N=256, if device has enough resource, it does't need to split again.But when shape is big, the memory can't hold MK or KN data, so it need data exchange between register and ddr. So spliting M or N can solve above problems.
origin_shape:512x512x512(MKN)
first_tiling(Block):256x512x512
second_tiling(sub_Block):128x512x512
third_tiling:128128128(Block_sizeM/K/N)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super-Grouping code Question #3016

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Super-Grouping code Question #3016

zh-plus Jan 25, 2024

Replies: 1 comment

gavin838 Sep 9, 2024

zh-plus
Jan 25, 2024

gavin838
Sep 9, 2024