-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix half2 with HIP #146
base: master
Are you sure you want to change the base?
Fix half2 with HIP #146
Conversation
This doesn't work for me. Still spouting gibberish with -fh2. |
That sucks. I thought I had figured it out. I'm using a 5700XT with rocblas and pytorch 2.0.1 recompiled on ROCm 5.4.3, so it's possible it's avoiding something else that happens related to feature differences like the dot product opcodes it doesn't have. With __compat_h2rcp as from ROCM5.6, I got gibberish with silu_mul_cuda_kernel with half2. |
I spent a bit more time testing your patch. It seems to be a bit more coherent or at least different than without it. Without patch:
With patch:
As you can see So, there are still some half2 problem in fused MLP. |
Do they actually matter, though? Maybe you could benchmark it with and without |
@ardfork can you try clearing the contents of the exllama_ext cache? In my case it was in ~/.cache/torch_extensions/py311_cpu/exllama_ext I tried first running exllama without the patch and with the cache deleted, and then ran with the patch without deleting the cache and got gibberish. After clearing the cache it worked. With using functions to access the component halfs of half2, undefining HIP_NO_HALF_CONVERSIONS doesn't seem to be necessary anymore. |
That work now. Look like the change in hip_compat.cuh weren't taken into account when rebuilding. I wonder if adding a useless change in q4_mlp to force it to be rebuild would be useful (or find another way to force a rebuild of it), since people are going to have the same problem as I had. As for speed, on my machine it is the same using half2 or not, but it's nice to be able to use everything. It might be a good idea to raise an issue or a PR upstream to get that h2rcp fixed, as they already tried to fix it in 5.6. If they are no performance drop on most HIP supported platform, I think we should revert #afc8b7cd. And also revert oobabooga/text-generation-webui@3c076c3 once it use a version of exllama with your patch merged. If they are no problem with using |
Changing I raised an issue with hipamd for h2rcp(). ROCm/clr#8 |
Stop defaulting to --no_half2 with hip, and whitespace added to q4_mlp to help make sure it gets recompiled.
I went and added a newline to q4_mlp.cu anyways in case someone using exllama downstream is using their own code for loading the extension. I removed hip defaulting to I would assume that performance should be fine for anyone with an RDNA, CDNA, or Vega GPU. Anything from the RX 580 era lacks the packed math instructions so half2 might be slower for them depending on what the compiler does. |
They have moved hipamd to https://github.com/ROCm-Developer-Tools/clr, might be a good idea to also put your issue on that new repo, I fear that it might get forgotten on the old one. |
@@ -8,8 +8,8 @@ __device__ __forceinline__ __half __compat_hrcp(__half x) { | |||
} | |||
|
|||
__device__ __forceinline__ __half2 __compat_h2rcp(__half2 x) { | |||
return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(x.x)), | |||
static_cast<_Float16>(__builtin_amdgcn_rcph(x.y))}; | |||
return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(static_cast<__half2_raw>(x).data.x)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know where I can read about this __builtin_amdgcn_rcph
?
The current version of CUDA allows you to access the component halfs of half2 through half2.x and half2.y, but in HIP x and y are unsigned shorts and not half or _float16. This replaces accessing those variables with intrinsic functions.
With the changes I can use --force_half2 and still get sensible text with my AMD GPU.