Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Generic matmul-like not lowering to MFMA by default #19864

Open
zjgarvey opened this issue Jan 31, 2025 · 1 comment
Open

[GPU] Generic matmul-like not lowering to MFMA by default #19864

zjgarvey opened this issue Jan 31, 2025 · 1 comment

Comments

@zjgarvey
Copy link
Contributor

Issue

Trying to compile some linalg IR with linalg.generic ops that are essentially matmuls, the default compiler flags don't result in the matmul-like ops getting lowered to mfma instructions for mi300 (even after trying out a few reasonable sounding flags like iree-preprocessing-pad-to-intrinsics). After asking a few people, I was told to try using the flag --iree-codegen-llvmgpu-test-tile-and-fuse-matmul, which did generate these instructions.

The option --debug-only=iree-llvmgpu-kernel-config (also suggested to me) provided some valuable information about what is going on, but this issue arose when trying to work with other teams who are testing IREE out from a pip install (so I don't think debug info is available to them).

I suppose the ask here is: can we turn on -iree-codegen-llvmgpu-test-tile-and-fuse-matmul by default?

Some other asks: Is it reasonable to emit a warning (thinking of non-codegen folks) that says "hey, this big matmul-like op isn't going down a high-performance path, try using *** to see what is going on".

Generic IR

Here is an IR snippet I was trying to benchmark for MI300:

#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func @matmul(%arg0: tensor<7x7x512xf32>, %arg1: tensor<512x2048xf32>) -> tensor<7x7x2048xf32> {
    %collapsed = tensor.collapse_shape %arg0 [[0, 1], [2]] : tensor<7x7x512xf32> into tensor<49x512xf32>
    %0 = tensor.empty() : tensor<49x2048xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed, %arg1 : tensor<49x512xf32>, tensor<512x2048xf32>) outs(%0 : tensor<49x2048xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %2 = arith.mulf %in, %in_0 : f32
      %3 = arith.addf %out, %2 : f32
      linalg.yield %3 : f32
    } -> tensor<49x2048xf32>
    %expanded = tensor.expand_shape %1 [[0, 1], [2]] output_shape [7, 7, 2048] : tensor<49x2048xf32> into tensor<7x7x2048xf32>
    return %expanded : tensor<7x7x2048xf32>
  }
}

Trying

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 matmul.mlir -o matmul.vmfb

Then

iree-benchmark-module --module=matmul.vmfb --function=matmul --input=7x7x512xf32 --input=512x2048xf32

Gave poor performance, so (naively) I took a look at some compile dumps with --mlir-print-ir-after-all, and noticed that this matmul was getting converted into vector.fma ops instead of amdgpu.mfma ops.

When trying

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-pad-to-intrinsics)' matmul.mlir -o matmul.vmfb

It does seem to pad the matmul, but I still get vector.fma instructions instead of amdgpu.mfma.

IR With Named Op

Here is some equivalent IR to the first:

func.func @matmul(%arg0: tensor<7x7x512xf32>, %arg1: tensor<512x2048xf32>) -> tensor<7x7x2048xf32> {
    %collapsed = tensor.collapse_shape %arg0 [[0,1],[2]] : tensor<7x7x512xf32> into tensor<49x512xf32>
    %empty = tensor.empty() : tensor<49x2048xf32>
    %0 = linalg.matmul ins(%collapsed, %arg1 : tensor<49x512xf32>, tensor<512x2048xf32>) outs(%empty: tensor<49x2048xf32>) -> tensor<49x2048xf32>
    %expand = tensor.expand_shape %0 [[0, 1], [2]] output_shape [7, 7, 2048] : tensor<49x2048xf32> into tensor<7x7x2048xf32>
    return %expand : tensor<7x7x2048xf32>
}

With no flags, this also lowers to vector.fma ops (probably expected if it doesn't match any intrinsics). With the pad-to-intrinsics preprocessing pass, however, it does actually lower to amdgpu.mfma and showed significant performance improvement (around 10x faster than the generic version with the same flag). I'm not sure why the preprocessing pass works here and not with the generic op.

@IanWood1
Copy link
Contributor

One thing to note is that --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-pad-to-intrinsics)' isn't working because it is being undone by iree-dispatch-creation-bubble-up-extract-slices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants