-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm Port #1087
ROCm Port #1087
Conversation
What does hipBLAS do? |
hipBLAS is just basically a wrapper around rocBLAS or cuBLAS. Well, all of HIP is supposed to be. |
Now HIP Clang is not required, the CMake scripts will configure the needed compiler, which can be system clang++. Also other code can still use GCC, but CMake will force the clang to link.
I have started moving all the cuda specific stuff to |
I'll try to rebase on your code. As for perf, it's about 38 ms for 7B, GPU is Vega64 |
Either the perplexity time per pass or the prompt eval times with a big prompt seems good enough to measure performance, that's what I have been doing anyway. Use |
GPU is used at about 30%, VRAM 2G |
I'm now building it in AMD's official Docker image and it is giving me double the performance... 🤯
|
@slaren can you check in Cuda, currently |
|
Thank you for the great work.
Llama 30B Q4_2 And my pc's running test with rocm suit 5.4.2 is below: Master 50cb666 Hipblas: Meanwhile maybe it's better to mention CXX also need to be changed to hipcc Peak vram usage about 1.4 G, while running perplexity is about 2 G. |
This --memory_f32 is Working with gfx1035 (HSA gfx1030) indeed the vega integrated gpu 680M More detail: I didn't set cxx=clang, but cxx=hipcc. Maybe that's the reason? |
I think the issue with |
Bonus picture, running on a Steam Desk with Steam OS. I have installed containerd so I don't have to install any ROCm stuff. To achieve this, the env var hipBLAS eval (plugged in 🔌): 49 ms per token. |
I was trying to make it work on HIP too (here is my fork https://github.com/DGdev91/llama.cpp) but i wasn't able to make it work, it was stuck after showing the "llama_model_load_internal" rows. Any idea on how can i try to figure out what is going on? |
@DGdev91 that means it is crashing when trying to initialize HIP or hipBLAS. What compiler did you use? The What is the GPU target that you used? Should be The CMake file seems to be just broken. EDIT: I forgot to mention, but when I managed to compile your code, it was running fine on the GPU 😃 |
You are right, but forget my fork, it was just an experiment. i already said i prefer your solution, and i had the same exact issue even there. |
I suspect it has something to do with the GPU architecture that is being built. My Makefile changes will detect the GPU of your system but that may not work if you're overriding it on the command line. On the Steam Deck I had to build it for one specific one (gfx1030) because that's the one rocBLAS supports. This is something that should happen automatically and not be on the user to fix. I need to figure it out. |
I compiled it with make LLAMA_HIPBLAS=1 GPU_TARGETS=gfx1030 and launched export HSA_OVERRIDE_GFX_VERSION=10.3.0 before launching main. There must be something else. |
Perplexity Testing for hipBLAS versionCodeCommit: 3a004b2a0166e412d8d54052c50bfd093611ad95 ModelsI should mention that the Q4_0 models were converted some time ago so I don't know if they are "fresh" with the latest quantization fixes.
HardwareCPU: Intel Core i7 7700K (4c/8t), 4.7 GHz (OC) Arch Linux testing with:OS: Arch Linux 6.2.11-arch1-1 AMD official Docker with this Dockerfile:rocm.DockerfileFROM rocm/dev-ubuntu-22.04
ARG GPU_TARGETS="gfx900"
ARG MAKE_JOBS=4
RUN apt-get update && \
apt-get --no-install-recommends install -y hipblas-dev
WORKDIR /app
COPY . ./
RUN make \
LLAMA_HIPBLAS=1 \
GPU_TARGETS="$GPU_TARGETS" \
-j $MAKE_JOBS \
main perplexity
STOPSIGNAL SIGKILL
ENV PATH="/app:$PATH"
CMD [ "main" ] Compile with: docker build -f ~/Desktop/rocm.Dockerfile . -t llama.cpp:rocm Results7B Q4_0, Arch: [655]6.2818
7B Q4_0 --memory_f32, Arch: [655]6.2838,
7B Q4_0, Docker: [655]6.2819,
7B Q4_0 --memory_f32, Docker: [655]6.2838,
7B F16, Docker: [655]5.9564,
7B Q4_1, Docker: [655]6.1290,
7B Q4_2, Docker: [655]6.2002,
7B Q4_3, Docker: [655]6.0619,
|
No. I don't have 70b q4 ready and there wouldn't be a point anyways since with 16 GB VRAM I would just be benchmarking the speed of the CPU. |
* use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (ggerganov#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <[email protected]> Co-authored-by: ardfork <[email protected]> Co-authored-by: funnbot <[email protected]> Co-authored-by: Engininja2 <[email protected]> Co-authored-by: Kerfuffle <[email protected]> Co-authored-by: jammm <[email protected]> Co-authored-by: jdecourval <[email protected]>
I implemented mul_mat_q tunings for RDNA 2 (using my RX 6800): #2910 . Please check whether they are better/worse on other AMD GPUs. |
What tuning value would you recommend? Do you want us to check via regular use or by a perplexity test? |
The RDNA 2 tunings are currently being applied to all AMD GPUs. Just checking whether the PR is slower or faster than master is enough. |
While testing #2910, I did some newer benchmark on q4_K_M (on a 6700 XT):
Sadly, I forgot to measure VRAM usage. |
Would something like I'm sorry if this is the wrong fora! |
@ggerganov @SlyEcho I was able to compile the ROCm version successfully on Windows using the HIP SDK. Ran it successfully on 7900XTX. Not sure of the speed though. How do I check that? The command I used |
@SlyEcho will there be hipBLAS builds for Windows uploaded in the packages now? |
Unfortunately, I still can't get it to work on Windows. Compiling is not the problem, it worked. Unfortunately, it cannot be started or crashes after starting the server, as mentioned above. Seems like I must live with that as my 6650 XT has no official support for Windows yet. I just don't understand why it works under Linux with the 1030 overwrite, but not on Windows. |
I would like to test it with my card on Linux. How can i measure it like this? I'm new in this topic. |
Use the |
Updated CI example building llama-cpp-python for both Windows and Linux: The code for building libs should still be relevant if only building llama.cpp. It is curious that the 6650 XT doesn't work on Windows given that GPU is explicitly listed as supported in the runtime: |
The issue is that rocBLAS on Windows comes compiled & with tensile libs for gfx906, gfx1030, gfx1100, gfx1101, and gfx1102. There's no HSA_OVERRIDE_GFX_VERSION because it's running on top of PAL instead of HSA. PAL might have an equivalent, from reading its repo. So there might be a registry setting that could work but it may need recompiling part of the HIP SDK anyways. |
There may come a time when rocBLAS is not needed, then it would work. |
Speaking of which, one of my next goals is to try and quantize the KV cache to q8_1. It will probably take some time but if that is done (and works) you could compile completely without cuBLAS/rocBLAS. |
Is there a timeline for this? I'd like to how how many users here are using navi22 and navi23. If it's worthwhile to push for hipBLAS to support it in its precompiled form until the hipBLAS dependency is removed, I can at least request for it internally, no guarantees though. So navi22 and navi23 users, feel free to use the rocket emoji. Also if you're an APU user using phoenix, use the hooray emoji. |
I can't give a serious ETA because there are too many uncertainties. It will be done when it's done. |
I looked at doing this (for other reasons, like making the prompt caches smaller or reducing VRAM usage) in the past. Seems like it'll require making a number of operations that currently only work on 32bit tensors support quantized ones also. Another nice side benefit may be making it easier to support other models that could benefit from using those ops on quantized tensors. |
The following was run on Windows using the HIP SDK: Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0
|
Is anyone having issues compiling the hipBLAS backend with the cmakelists.txt file on Windows after the ggml-cuda was broken up into different files in its own folder? |
Currently I can say that for regular users the CLBlast version is much easier to run. If you want the most performance, though, HIP is for you.
Remember to tweak the new settings
LLAMA_CUDA_DMMV_X
andLLAMA_CUDA_MMV_Y
,LLAMA_CUDA_KQUANTS_ITER
I get the best result with 128, 8 and 1, for example.
Note for unsupported GPU users:
You need to use an environment variable to force ROCm to run.
You can ckeck this resource: ROCm supported-gpu-list
export HSA_OVERRIDE_GFX_VERSION=10.3.0
This will make it work in the currently running shell, after that ./main and other llama.cpp commands will run.
rocBLAS is only released for a limited number of GPUs: gfx900 gfx906 gfx908 gfx90a gfx1030 (depends on ROCm version, etc).
If you look in /opt/rocm/lib/rocblas/library/ you should see a lot of files, but only for some GPUs, for others you need to find something that is close enough, like gfx1030 instead of gfx1033, and then that becomes
10.3.0
for the environment variable.If you have multiple AMD devices:
If you have a GPU and APU then it may try to use wrong devices. There is an environment variable you can set to control the selected device:
export HIP_VISIBLE_DEVICES=0
ROCm port
I just define all the
cudaXxx
functions tohipXxx
etc. This may seem stupidly simple but it's exactly the same kind of trick AMD uses to make HIP code compile withnvcc
, you can see it in/opt/rocm/include/hip/nvidia_detail/nvidia_hip_runtime_api.h
(for some reason I can't find the source for this anywhere online but it has a free license, so if you want, I can post it).HIP can also compile the Cuda kernel programs without any major modifications, just some header stuff.
Compiling
To this, you need the ROCm developer kit and hipBLAS which may be a separate package.
With CMake I have to invoke:
It is probably unavoidable to use the LLVM Clang compiler You can use the ROCm included one or the system one, but mixing it with GCC objects is just asking for trouble.
Makefile should work, too, pass in
LLAMA_HIPBLAS=1
. You can use the env variableROCM_PATH
if ROCm is not installed at/opt/rocm
:Makefile will override the compilers to ROCm LLVM, so it should be a simple command to compile. But you should be able to override the compilers on the make command line.
Docker
Probably the best option right now is using Docker with AMD's images:
Save it somewhere as
rocm.Dockerfile
then in llama.cpp's source do:docker build -f /path/to/rocm.Dockerfile . -t llama.cpp:rocm
Then run it like this:
docker run --rm -it --init \ --device /dev/dri --device /dev/kfd \ -v/my/models:/models llama.cpp:rocm \ main -m /models/llama-7b-q4_2.bin -p "$(cat prompts/dan.txt)"
You can also add the override like this:
-e HSA_OVERRIDE_GFX_VERSION=10.3.0
and-e HIP_VISIBLE_DEVICES=0
as needed. There may be also some other security flags needed on some distros, and whatever permissions your user needs to have for the devices (usually groupvideo
).Using nerdctl, I had to add the DRI devices separately (
--device /dev/dri/card0 --device /dev/dri/renderD128
rather than the/dev/dri
directory like in Docker), it also works, but beware that on some buildkit setups it will load the whole image via tarballs and since it's several gigabytes it will take some time to build.All the commands are there besides main, you can also run
/bin/bash
for a dev shell, mount the llama.cpp source somewhere and use it for development. It is a bit of a thick image, for end users, maybe too big, I want to trim it down but the AMD stuff is bloated.What's up with the compilers?
Regarding hipcc, it is not really a compiler, I had a lot of problems with it, it couldn't compile and link .cpp and .o files together (like
hipcc main.cpp llama.o ggml.o ...
). If you open it in a text editor you see it's a Perl script and all it does is provide some default flags for the Clang compiler. It might work in CMake, since CMake always compiles to objects first.It shouldn't be a requirement to use AMD's version of Clang, it is possible to use any normal Clang or LLVM (maybe even Zig?) to compile the device code. In the CMake build I added a warning if the compiler is not Clang but it won't stop you from experimenting (well, it will probably fail to compile the .cu file).
If you use VS Code then the C/C++ plugin doesn't support HIP correctly, it sees in
compileCommands.json
(part of CMake's output) that the .cu file is using a language argument-x hip
and it doesn't know what that is, so the whole file is locked to the C language even if it's actually C++ and you'll see some red squiggles. This flag comes from thehip::device
package in CMake.In CMake it is harder to use different compilers in the same project (may need to use a subdirectory) than in Make, so currently the .cu file is handled as a C++ file and compiled with the rest of the C++ files, this is what AMD's vision is with HIP -- they should just be normal C++ files.
I also tried adding another language, HIP
enable_language(HIP)
, to CMake but I had some trouble getting the CMake to configure in all environments consistently, maybe it it needs some package that was missing in the container. In this case, it would work more similar to Cuda: I can define the .cu file's language to be HIP, whatever compiler configured for HIP compiles it and a compiler that can link it correctly will link it to an executable. When it was working on Arch, it configured it automatically like:CMAKE_CXX_COMPILER=/usr/bin/g++
andCMAKE_HIP_COMPILER=/usr/bin/clang++
and it was working correctly, using the HIP compliler to link in the end. This would be the ideal solution, it would give the user the most control over the config -- if I got it to work, that is 😜. If someone more experienced with this knows how to do it, please go ahead.For the Makefile I thought it would be easier to override the compilers, because it is supposed to be more beginner friendly and you can get a result in one command (that is if everything is installed properly). But it has some variables also.