[GPU] Generic matmul-like not lowering to MFMA by default #19864

zjgarvey · 2025-01-31T07:24:50Z

Issue

Trying to compile some linalg IR with linalg.generic ops that are essentially matmuls, the default compiler flags don't result in the matmul-like ops getting lowered to mfma instructions for mi300 (even after trying out a few reasonable sounding flags like iree-preprocessing-pad-to-intrinsics). After asking a few people, I was told to try using the flag --iree-codegen-llvmgpu-test-tile-and-fuse-matmul, which did generate these instructions.

The option --debug-only=iree-llvmgpu-kernel-config (also suggested to me) provided some valuable information about what is going on, but this issue arose when trying to work with other teams who are testing IREE out from a pip install (so I don't think debug info is available to them).

I suppose the ask here is: can we turn on -iree-codegen-llvmgpu-test-tile-and-fuse-matmul by default?

Some other asks: Is it reasonable to emit a warning (thinking of non-codegen folks) that says "hey, this big matmul-like op isn't going down a high-performance path, try using *** to see what is going on".

Generic IR

Here is an IR snippet I was trying to benchmark for MI300:

#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func @matmul(%arg0: tensor<7x7x512xf32>, %arg1: tensor<512x2048xf32>) -> tensor<7x7x2048xf32> {
    %collapsed = tensor.collapse_shape %arg0 [[0, 1], [2]] : tensor<7x7x512xf32> into tensor<49x512xf32>
    %0 = tensor.empty() : tensor<49x2048xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed, %arg1 : tensor<49x512xf32>, tensor<512x2048xf32>) outs(%0 : tensor<49x2048xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %2 = arith.mulf %in, %in_0 : f32
      %3 = arith.addf %out, %2 : f32
      linalg.yield %3 : f32
    } -> tensor<49x2048xf32>
    %expanded = tensor.expand_shape %1 [[0, 1], [2]] output_shape [7, 7, 2048] : tensor<49x2048xf32> into tensor<7x7x2048xf32>
    return %expanded : tensor<7x7x2048xf32>
  }
}

Trying

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 matmul.mlir -o matmul.vmfb

Then

iree-benchmark-module --module=matmul.vmfb --function=matmul --input=7x7x512xf32 --input=512x2048xf32

Gave poor performance, so (naively) I took a look at some compile dumps with --mlir-print-ir-after-all, and noticed that this matmul was getting converted into vector.fma ops instead of amdgpu.mfma ops.

When trying

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-pad-to-intrinsics)' matmul.mlir -o matmul.vmfb

It does seem to pad the matmul, but I still get vector.fma instructions instead of amdgpu.mfma.

IR With Named Op

Here is some equivalent IR to the first:

func.func @matmul(%arg0: tensor<7x7x512xf32>, %arg1: tensor<512x2048xf32>) -> tensor<7x7x2048xf32> {
    %collapsed = tensor.collapse_shape %arg0 [[0,1],[2]] : tensor<7x7x512xf32> into tensor<49x512xf32>
    %empty = tensor.empty() : tensor<49x2048xf32>
    %0 = linalg.matmul ins(%collapsed, %arg1 : tensor<49x512xf32>, tensor<512x2048xf32>) outs(%empty: tensor<49x2048xf32>) -> tensor<49x2048xf32>
    %expand = tensor.expand_shape %0 [[0, 1], [2]] output_shape [7, 7, 2048] : tensor<49x2048xf32> into tensor<7x7x2048xf32>
    return %expand : tensor<7x7x2048xf32>
}

With no flags, this also lowers to vector.fma ops (probably expected if it doesn't match any intrinsics). With the pad-to-intrinsics preprocessing pass, however, it does actually lower to amdgpu.mfma and showed significant performance improvement (around 10x faster than the generic version with the same flag). I'm not sure why the preprocessing pass works here and not with the generic op.

The text was updated successfully, but these errors were encountered:

IanWood1 · 2025-01-31T16:26:30Z

One thing to note is that --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-pad-to-intrinsics)' isn't working because it is being undone by iree-dispatch-creation-bubble-up-extract-slices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Generic matmul-like not lowering to MFMA by default #19864

[GPU] Generic matmul-like not lowering to MFMA by default #19864

zjgarvey commented Jan 31, 2025

IanWood1 commented Jan 31, 2025

[GPU] Generic matmul-like not lowering to MFMA by default #19864

[GPU] Generic matmul-like not lowering to MFMA by default #19864

Comments

zjgarvey commented Jan 31, 2025

Issue

Generic IR

IR With Named Op

IanWood1 commented Jan 31, 2025