You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to compile some linalg IR with linalg.generic ops that are essentially matmuls, the default compiler flags don't result in the matmul-like ops getting lowered to mfma instructions for mi300 (even after trying out a few reasonable sounding flags like iree-preprocessing-pad-to-intrinsics). After asking a few people, I was told to try using the flag --iree-codegen-llvmgpu-test-tile-and-fuse-matmul, which did generate these instructions.
The option --debug-only=iree-llvmgpu-kernel-config (also suggested to me) provided some valuable information about what is going on, but this issue arose when trying to work with other teams who are testing IREE out from a pip install (so I don't think debug info is available to them).
I suppose the ask here is: can we turn on -iree-codegen-llvmgpu-test-tile-and-fuse-matmul by default?
Some other asks: Is it reasonable to emit a warning (thinking of non-codegen folks) that says "hey, this big matmul-like op isn't going down a high-performance path, try using *** to see what is going on".
Generic IR
Here is an IR snippet I was trying to benchmark for MI300:
Gave poor performance, so (naively) I took a look at some compile dumps with --mlir-print-ir-after-all, and noticed that this matmul was getting converted into vector.fma ops instead of amdgpu.mfma ops.
With no flags, this also lowers to vector.fma ops (probably expected if it doesn't match any intrinsics). With the pad-to-intrinsics preprocessing pass, however, it does actually lower to amdgpu.mfma and showed significant performance improvement (around 10x faster than the generic version with the same flag). I'm not sure why the preprocessing pass works here and not with the generic op.
The text was updated successfully, but these errors were encountered:
One thing to note is that --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-pad-to-intrinsics)' isn't working because it is being undone by iree-dispatch-creation-bubble-up-extract-slices
Issue
Trying to compile some linalg IR with
linalg.generic
ops that are essentially matmuls, the default compiler flags don't result in the matmul-like ops getting lowered to mfma instructions for mi300 (even after trying out a few reasonable sounding flags likeiree-preprocessing-pad-to-intrinsics
). After asking a few people, I was told to try using the flag--iree-codegen-llvmgpu-test-tile-and-fuse-matmul
, which did generate these instructions.The option
--debug-only=iree-llvmgpu-kernel-config
(also suggested to me) provided some valuable information about what is going on, but this issue arose when trying to work with other teams who are testing IREE out from a pip install (so I don't think debug info is available to them).I suppose the ask here is: can we turn on
-iree-codegen-llvmgpu-test-tile-and-fuse-matmul
by default?Some other asks: Is it reasonable to emit a warning (thinking of non-codegen folks) that says "hey, this big matmul-like op isn't going down a high-performance path, try using *** to see what is going on".
Generic IR
Here is an IR snippet I was trying to benchmark for MI300:
Trying
Then
Gave poor performance, so (naively) I took a look at some compile dumps with
--mlir-print-ir-after-all
, and noticed that this matmul was getting converted intovector.fma
ops instead ofamdgpu.mfma
ops.When trying
It does seem to pad the matmul, but I still get
vector.fma
instructions instead ofamdgpu.mfma
.IR With Named Op
Here is some equivalent IR to the first:
With no flags, this also lowers to
vector.fma
ops (probably expected if it doesn't match any intrinsics). With thepad-to-intrinsics
preprocessing pass, however, it does actually lower toamdgpu.mfma
and showed significant performance improvement (around 10x faster than the generic version with the same flag). I'm not sure why the preprocessing pass works here and not with the generic op.The text was updated successfully, but these errors were encountered: