[RFC] Improve coalescing/layout conversion logic #2007

davidberard98 · 2023-07-28T22:07:23Z

davidberard98
Jul 28, 2023

TL;DR: We have a use case for a pointwise operation where layout conversions appear to hurt performance. This is a proposal to change the coalescing pass logic to reduce the number of layout conversions. Before we look further into this idea, we’d appreciate feedback on whether this idea sounds good, whether a PR to fix this would be acceptable (e.g. I’ve seen some comments that this code path may be modified significantly as part of Hopper work), and whether there might be other side effects from this change that we didn’t foresee.

The proposal is to provide more flexibility when deciding layouts; currently, most ops are assigned a default layout while certain other ops (e.g. loads) are assigned special layouts - which means that a layout conversion is required for each of the layout conversions. Instead, we suggest choosing layouts for unspecified ops in a way that reduces the number of layout conversions. More details are shown below.

Motivation - demonstration of excess layout conversions

TL;DR: In the example kernel, the majority of the instructions are run in “blocked1” layout; but some of the loads are converted to “blocked” layout. The layout conversion between “blocked” and “blocked1” appears to increase the latency for this kernel. Initially on A100, latency is 92us; after removing the layout conversion (with a hacky patch), latency is 67us.

The example kernel I’m testing with is linked here: https://gist.github.com/davidberard98/c0cc39f3a2324936abbfe5d8c98eba48 - the triton kernel section is shown inline below:

# Launch: BLOCK_SIZE = 1024, num_warps = 4
# grid: 1D grid of triton.cdiv(approx. 1024*130*256, 1024)
@triton.jit
def dense_to_jagged_triton(
    in_ptr,
    offsets_ptr,
    inverse_offsets_ptr,
    out_ptr,
    JAGGED_TOTAL_LEN,
    MAX_SEQ_LEN,
    BLOCK_SIZE : tl.constexpr,
):
    idx = tl.program_id(0) * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    x = idx % 256
    y = idx // 256

    mask = y < JAGGED_TOTAL_LEN

    batch_load_ptrs = inverse_offsets_ptr + y
    batch_idx = tl.load(batch_load_ptrs, mask)

    seq_load_ptrs = offsets_ptr + batch_idx
    seq_idx = y - tl.load(seq_load_ptrs, mask)

    dense_mask = seq_idx < MAX_SEQ_LEN
    values = tl.load(in_ptr + x + seq_idx * 256 + batch_idx * 256 * MAX_SEQ_LEN, mask & dense_mask)
    masked_values = tl.where(dense_mask, values, 0.0)
    tl.store(out_ptr + x + y * 256, masked_values, mask)

Logically, this kernel does the following:

for r in range(R):
    batch_idx = inverse_offsets[r]
    seq_idx = r - offsets[batch_idx]
    for c in range(C):
        out[r][c] = inp[batch_idx][seq_idx][c] if seq_idx < MAX_SEQ_LEN else 0

The corresponding TTGIR shows some layout conversions:

#blocked = #triton_gpu.blocked<{sizePerThread = [8], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
#blocked1 = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32, "triton_gpu.threads-per-warp" = 32 : i32} {
  tt.func public @dense_to_jagged_triton_0d1d2d3d4d5(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<i32> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg4: i32 {tt.divisibility = 16 : i32}, %arg5: i32) attributes {noinline = false} {
    %cst = arith.constant dense<256> : tensor<1024xi32, #blocked>
    %cst_0 = arith.constant dense<0.000000e+00> : tensor<1024xf32, #blocked1>
    %cst_1 = arith.constant dense<256> : tensor<1024xi32, #blocked1>
    %c1024_i32 = arith.constant 1024 : i32
    %0 = tt.get_program_id x : i32
    %1 = arith.muli %0, %c1024_i32 : i32
    %2 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked1>
    %3 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
    %4 = tt.splat %1 : (i32) -> tensor<1024xi32, #blocked1>
    %5 = tt.splat %1 : (i32) -> tensor<1024xi32, #blocked>
    %6 = arith.addi %4, %2 : tensor<1024xi32, #blocked1>
    %7 = arith.addi %5, %3 : tensor<1024xi32, #blocked>
    %8 = arith.remsi %6, %cst_1 : tensor<1024xi32, #blocked1>
    %9 = arith.remsi %7, %cst : tensor<1024xi32, #blocked>
    %10 = arith.divsi %6, %cst_1 : tensor<1024xi32, #blocked1>
    %11 = arith.divsi %7, %cst : tensor<1024xi32, #blocked>
    %12 = tt.splat %arg4 : (i32) -> tensor<1024xi32, #blocked1>
    %13 = tt.splat %arg4 : (i32) -> tensor<1024xi32, #blocked>
    %14 = "triton_gpu.cmpi"(%10, %12) <{predicate = 2 : i64}> : (tensor<1024xi32, #blocked1>, tensor<1024xi32, #blocked1>) -> tensor<1024xi1, #blocked1>
    %15 = "triton_gpu.cmpi"(%11, %13) <{predicate = 2 : i64}> : (tensor<1024xi32, #blocked>, tensor<1024xi32, #blocked>) -> tensor<1024xi1, #blocked>
    %16 = tt.splat %arg2 : (!tt.ptr<i32>) -> tensor<1024x!tt.ptr<i32>, #blocked1>
    %17 = tt.addptr %16, %10 : tensor<1024x!tt.ptr<i32>, #blocked1>, tensor<1024xi32, #blocked1>
    %18 = tt.load %17, %14 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xi32, #blocked1>
    %19 = tt.splat %arg1 : (!tt.ptr<i32>) -> tensor<1024x!tt.ptr<i32>, #blocked1>
    %20 = tt.addptr %19, %18 : tensor<1024x!tt.ptr<i32>, #blocked1>, tensor<1024xi32, #blocked1>
    %21 = tt.load %20, %14 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xi32, #blocked1>
    %22 = arith.subi %10, %21 : tensor<1024xi32, #blocked1>
    %23 = tt.splat %arg5 : (i32) -> tensor<1024xi32, #blocked1>
    %24 = "triton_gpu.cmpi"(%22, %23) <{predicate = 2 : i64}> : (tensor<1024xi32, #blocked1>, tensor<1024xi32, #blocked1>) -> tensor<1024xi1, #blocked1>
    %25 = tt.splat %arg0 : (!tt.ptr<f16>) -> tensor<1024x!tt.ptr<f16>, #blocked1>
    %26 = tt.addptr %25, %8 : tensor<1024x!tt.ptr<f16>, #blocked1>, tensor<1024xi32, #blocked1>
    %27 = arith.muli %22, %cst_1 : tensor<1024xi32, #blocked1>
    %28 = tt.addptr %26, %27 : tensor<1024x!tt.ptr<f16>, #blocked1>, tensor<1024xi32, #blocked1>
    %29 = arith.muli %18, %cst_1 : tensor<1024xi32, #blocked1>
    %30 = arith.muli %29, %23 : tensor<1024xi32, #blocked1>
    %31 = tt.addptr %28, %30 : tensor<1024x!tt.ptr<f16>, #blocked1>, tensor<1024xi32, #blocked1>
    %32 = arith.andi %14, %24 : tensor<1024xi1, #blocked1>
    %33 = triton_gpu.convert_layout %31 : (tensor<1024x!tt.ptr<f16>, #blocked1>) -> tensor<1024x!tt.ptr<f16>, #blocked>
    %34 = triton_gpu.convert_layout %32 : (tensor<1024xi1, #blocked1>) -> tensor<1024xi1, #blocked>
    %35 = tt.load %33, %34 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf16, #blocked>
    %36 = triton_gpu.convert_layout %35 : (tensor<1024xf16, #blocked>) -> tensor<1024xf16, #blocked1>
    %37 = arith.extf %36 : tensor<1024xf16, #blocked1> to tensor<1024xf32, #blocked1>
    %38 = "triton_gpu.select"(%24, %37, %cst_0) : (tensor<1024xi1, #blocked1>, tensor<1024xf32, #blocked1>, tensor<1024xf32, #blocked1>) -> tensor<1024xf32, #blocked1>
    %39 = tt.splat %arg3 : (!tt.ptr<f16>) -> tensor<1024x!tt.ptr<f16>, #blocked>
    %40 = tt.addptr %39, %9 : tensor<1024x!tt.ptr<f16>, #blocked>, tensor<1024xi32, #blocked>
    %41 = arith.muli %11, %cst : tensor<1024xi32, #blocked>
    %42 = tt.addptr %40, %41 : tensor<1024x!tt.ptr<f16>, #blocked>, tensor<1024xi32, #blocked>
    %43 = arith.truncf %38 : tensor<1024xf32, #blocked1> to tensor<1024xf16, #blocked1>
    %44 = triton_gpu.convert_layout %43 : (tensor<1024xf16, #blocked1>) -> tensor<1024xf16, #blocked>
    tt.store %42, %44, %15 {cache = 1 : i32, evict = 1 : i32} : tensor<1024xf16, #blocked>
    tt.return
  }
}

It contains 4 layout conversions:

2 to convert the inputs to a load from “blocked1” to “blocked”
1 to convert the output from the load from “blocked” to “blocked1”
1 to convert the inputs to a store from ”blocked1” to “blocked”

Note that the “blocked” layout is needed for the loads and stores in order to enable vectorization.

In a patch (only applicable for this specific kernel), we tried converting the layout to “blocked” everywhere, which eliminates the need for layout conversions. This patch shows a speedup from 92us to 67us.

Proposed changes to coalescing & layout conversions

My understanding of coalescing and layout conversions:

During conversion to TritonGPU, a default blocked layout is assigned to most tensor values.
In the coalescing pass, certain ops which benefit from specific layouts (e.g. loads and stores which can be vectorized) are identified and assigned specific layouts. To achieve the desired layout, layout conversion ops are inserted.
There are other layout-related passes that occur later, e.g. related to dot products, and passes to reorder/remove unnecessary layout conversions.

Proposal: instead of assigning the default blocked layout to all other tensor values, we can first assign layouts for ops that require specific layouts (like loads and stores), and then choose the layouts for other ops in order to reduce/minimize the number of layout conversions. I haven't looked closely enough to have specific details, but one example for how to do this is described below:

During the coalesce pass:

1. Identify nodes that require certain layout properties; e.g. a load requires sizePerThread[0] >= 4
2. Out of all the requested nodes, identify the most commonly requested layout patterns, possibly merging some (e.g if [a] requests sizePerThread[0] >= 4 and [b] requests sizePerThread[0] >= 8, then choose a layout with sizePerThread[0] >= 8)
3. Set the "default" to the most common layout pattern; assign this layout to all ops except those identified in step 2.

HydraQYH · 2024-01-22T02:09:04Z

HydraQYH
Jan 22, 2024

Hello, I recently found a similar problem. Layout Conversion happened in tt.store, it will cause bank conflict sometime, and it will hurt performance. How is this RFC going?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improve coalescing/layout conversion logic #2007

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[RFC] Improve coalescing/layout conversion logic #2007

davidberard98 Jul 28, 2023

Motivation - demonstration of excess layout conversions

Proposed changes to coalescing & layout conversions

Replies: 1 comment

HydraQYH Jan 22, 2024

davidberard98
Jul 28, 2023

HydraQYH
Jan 22, 2024