-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: subgroup size tuning #12087
base: master
Are you sure you want to change the base?
vulkan: subgroup size tuning #12087
Conversation
I haven't forgotten about this, I was just working on the int8 matmul. I'll get back to you here, soon. |
I tried doing device architecture detection in a better way in daniandtheweb#1, let me know what you think. |
c22e0b4
to
7176d04
Compare
I'll give this another try on RDNA2 soon, I expect that you can add RDNA2 to the RDNA1 logic. Once that is done I think we can merge this. Can you resolve the merge conflict? |
Sure, I'll do it later once I'm back home. |
7176d04
to
7695541
Compare
I was testing the change on RDNA3 on stable-diffusion.cpp and it seems like using subgroup size 32 on this architecture completely breaks image generation resulting in black images. Do you have any clue what could be causing this? Text generation works properly using wave 32 and there's even a small performance uplift in prompt processing (1164 master / 1238 subgroup 32). |
I don't know what it is, but to figure out which op is failing I implemented the |
For some reason all mul_mat and mul_mat_id fail on RDNA3 when using a subgroup size of 32 (I was in a rush before so I wasn't able to run the test-backend-ops) so I'll just avoid doing changes on this arch for now until I find a proper tuning for this card. |
That's a coopmat thing, I think. If you ignore coopmat shaders it's probably fine. Edit: Maybe disable setting the required subgroup size if |
Thanks for the advise, I still haven't looked at coopmat shaders so I didn't know how they would interact with these changes. With the new changes the tests pass and performance on RDNA3 is slightly better in prompt processing (1164 master / 1220 pr). Stable diffusion seems to get a minor improvement. For now I think it's safe to keep the same values as the ones on RDNA1 as the card seems to behave similarly when it comes to subgroup size. |
I checked and you can add RDNA2 to the list. It seems to prefer the same shader sizes. Afterwards we can merge this. |
This PR is a continuation of the tests done in #11826 about the subgroup size in vulkan and its effects in performance especially on RDNA cards. For now it includes a specific configuration that improves RDNA1 performance in almost all operations.