Performance of llama.cpp with Vulkan #10879

netrunnereve · 2024-12-18T03:56:09Z

netrunnereve
Dec 18, 2024
Collaborator

This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend in the past month and I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models, compare backends, and so forth, but only valid runs will be placed on the scoreboard.

If multiple entries for posted for the same device the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, and so forth even if the chip is the same.

Vulkan Scoreboard for Llama 2 7B, Q4_0 (sorted by tg128)

GPU	pp512 t/s	tg128 t/s	Commit	Comments
AMD Radeon RX 7900 XTX	2062.17 ± 5.36	143.99 ± 0.23	`53ff6b9`	Best single GPU result in multi GPU system
Nvidia RTX 3090	3301.47 ± 33.76	123.72 ± 0.14	`0d52a69`
Nvidia RTX 4070	3970.59 ± 12.83	93.87 ± 0.53	`9a48399`	coopmat2
AMD Radeon RX 6800 XT	863.03 ± 0.70	91.59 ± 0.40	`0d52a69`
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
AMD Radeon Pro VII	329.86 ± 0.80	75.22 ± 0.05	`2739a71`	Best of multiple submissions
AMD Radeon RX 5700 XT	439.42 ± 0.28	70.13 ± 0.05	c05e8c9
Nvidia RTX 3080	1706.07 ± 139.33	62.16 ± 1.98	`4da69d1`
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	`2739a71`
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	`091592d`
Intel Arc A770	314.24 ± 1.04	45.22 ± 0.25	ba8a1f9	Windows result, Linux is slower in pp512
Intel Arc B580	175.56 ± 2.65	44.12 ± 0.09	`9a48399`
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	`d79d8f3`
AMD RX 470	161.47 ± 0.43	33.45 ± 0.04	`4da69d1`
AMD FirePro W8100	137.10 ± 0.44	28.51 ± 0.12	`4da69d1`
Intel Arc A750	88.86 ± 0.14	27.57 ± 0.03	`8d59d91`
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	`914a82d`	Two GPU chips on one card
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	`53ff6b9`
Apple M2 Macbook Air	38.67 ± 0.03	11.07 ± 0.04	`017cc5f`	Asahi Linux
Intel i7-1185G7	42.02 ± 0.07	7.28 ± 0.24	`ff3fcab`

netrunnereve · 2024-12-18T03:58:41Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	137.10 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.51 ± 0.12

0 replies

netrunnereve · 2024-12-18T04:00:36Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	161.47 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.45 ± 0.04

0 replies

max-krasnyansky · 2024-12-18T05:09:04Z

max-krasnyansky
Dec 18, 2024
Collaborator

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1706.07 ± 139.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	4499.47 ± 60.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	131.01 ± 0.43

build: 4da69d1 (4351)

0 replies

hkbu-kennycheng · 2025-01-08T02:57:11Z

hkbu-kennycheng
Jan 8, 2025

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	38.67 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

1 reply

ericcurtin Jan 14, 2025
Collaborator

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

hkbu-kennycheng · 2025-01-08T03:22:16Z

hkbu-kennycheng
Jan 8, 2025

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	199.36 ± 7.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

0 replies

hkbu-kennycheng · 2025-01-08T10:35:31Z

hkbu-kennycheng
Jan 8, 2025

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1545.39 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

4 replies

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

hkbu-kennycheng Jan 8, 2025

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2022.59 ± 10.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2039.24 ± 18.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2062.17 ± 5.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1997.04 ± 5.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1668.19 ± 12.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1566.38 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1484.04 ± 6.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.48 ± 0.63

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

hkbu-kennycheng Jan 9, 2025

@netrunnereve I updated the commit id in all my result.

0cc4m · 2025-01-08T11:04:08Z

0cc4m
Jan 8, 2025
Collaborator

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	3301.47 ± 33.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	863.03 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	312.02 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	95.52 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	44.49 ± 0.03

0 replies

0cc4m · 2025-01-08T11:08:46Z

0cc4m
Jan 8, 2025
Collaborator

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

2 replies

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	160.08 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.41 ± 0.15

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

qnixsynapse · 2025-01-09T02:41:52Z

qnixsynapse
Jan 9, 2025

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	88.86 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	1616.11 ± 5.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	36.64 ± 0.05

edit: retested both with the default batch size.

7 replies

0cc4m Jan 10, 2025
Collaborator

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

qnixsynapse Jan 11, 2025

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

qnixsynapse Jan 11, 2025

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

0cc4m · 2025-01-09T15:32:01Z

0cc4m
Jan 9, 2025
Collaborator

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	pp512	94.78 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	tg128	25.32 ± 0.02

1 reply

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	pp512	147.84 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	tg128	30.77 ± 0.00

vkhodygo · 2025-01-10T12:21:36Z

vkhodygo
Jan 10, 2025

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1

timeout 240s $COMMAND

kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	42.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	42.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	41.89 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	7.22 ± 0.20

2 replies

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

vkhodygo Jan 10, 2025

That's something I didn't think about, with -ngl 0 it goes like this:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	30.51 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	9.87 ± 0.05

build: ba8a1f9 (4460)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	32.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	9.49 ± 0.18

0cc4m · 2025-01-10T20:27:15Z

0cc4m
Jan 10, 2025
Collaborator

Intel ARC A770 on Windows:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	314.24 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	45.22 ± 0.25

build: ba8a1f9 (4460)

0 replies

8XXD8 · 2025-01-11T12:48:55Z

8XXD8
Jan 11, 2025

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	329.86 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	324.55 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	pp512	32.29 ± 0.04
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	tg128	4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	409.83 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1064.99 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1061.87 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	pp512	16.36 ± 0.02
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	tg128	6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	pp512	30.86 ± 0.03
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	tg128	12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

1 reply

cb88 Jan 18, 2025

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

daniandtheweb · 2025-01-12T01:48:51Z

daniandtheweb
Jan 12, 2025

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.13 ± 0.05

HIP:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	354.17 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	214.48 ± 2.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	23.21 ± 0.08

HIP FA:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	314.17 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	62.02 ± 0.05

2 replies

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

daniandtheweb Jan 12, 2025

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

FNsi · 2025-01-12T06:17:07Z

FNsi
Jan 12, 2025

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100                         ggml_vulkan: Found 1 Vulkan devices:                  ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none                build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24

llama_perf_sampler_print:    sampling time =       3.31 ms /    24 runs   (    0.14 ms per token,  7242.00 tokens per second)                                     llama_perf_context_print:        load time =   28544.85 ms                                                  llama_perf_context_print: prompt eval time =    3788.63 ms /     3 tokens ( 1262.88 ms per token,     0.79 tokens per second)                                     llama_perf_context_print:        eval time =   23248.44 ms /    20 runs   ( 1162.42 ms per token,     0.86 tokens per second)                                     llama_perf_context_print:       total time =   27591.65 ms /    23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print:    sampling time =       5.00 ms /    43 runs   (    0.12 ms per token,  8608.61 tokens per second)                                     llama_perf_context_print:        load time =   10871.74 ms                                                  llama_perf_context_print: prompt eval time =    1228.38 ms /     3 tokens (  409.46 ms per token,     2.44 tokens per second)                                     llama_perf_context_print:        eval time =   17010.39 ms /    39 runs   (  436.16 ms per token,     2.29 tokens per second)                                     llama_perf_context_print:       total time =   18639.62 ms /    42 tokens

2 replies

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

FNsi Jan 13, 2025

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

hypengw · 2025-01-12T15:32:49Z

hypengw
Jan 12, 2025

Intel ARC B580 on Linux:
kernel: 6.12.8
mesa: 24.3.3

ggml_vulkan: 0 = Intel(R) Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	175.56 ± 2.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	44.12 ± 0.09

build: 9a48399 (4465)

0 replies

hkbu-kennycheng · 2025-01-12T17:22:00Z

hkbu-kennycheng
Jan 12, 2025

It's a Nintendo Switch (2017). Since OOM with llama-2-7b.Q4_0.gguf, I tried with a smaller model.

./build/bin/llama-bench -m ~/.cache/llama.cpp/bartowski_Dolphin3.0-Llama3.2-3B-GGUF_Dolphin3.0-Llama3.2-3B-Q4_1.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA Tegra X1 (nvgpu) (NVIDIA) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 3B Q4_1	1.94 GiB	3.21 B	Vulkan	100	pp512	40.79 ± 1.77
llama 3B Q4_1	1.94 GiB	3.21 B	Vulkan	100	tg128	4.87 ± 0.00

build: 9a48399 (1)

1 reply

hkbu-kennycheng Jan 14, 2025

Meanwhile with CUDA

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no

model	size	params	backend	ngl	test	t/s
llama 3B Q4_1	1.94 GiB	3.21 B	CUDA	100	pp512	42.72 ± 1.07
llama 3B Q4_1	1.94 GiB	3.21 B	CUDA	100	tg128	5.55 ± 0.01

build: 9a48399 (1)

cb88 · 2025-01-13T21:20:18Z

cb88
Jan 13, 2025

EPYC 7352 24core 64GB dual channel 2x MI60 Archlinux Mesa 24.3.3 radv

[cb88@M31-AR0 bin]$ ./llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	365.61 ± 3.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	45.46 ± 0.06

[cb88@M31-AR0 bin]$ export GGML_VK_VISIBLE_DEVICES=0
[cb88@M31-AR0 bin]$ ./llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	369.26 ± 2.48
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	78.16 ± 1.40

build: 504af20 (4476)

ROCm 6.2.4 only build for reference 2x MI60
[cb88@M31-AR0 bin]$ ./llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1298.05 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	74.98 ± 0.09

ROCm 6.2.4 gfx900 2x MI60
[cb88@M31-AR0 bin]$ HSA_OVERRIDE_GFX_VERSION=9.0.0 ./llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	495.29 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	76.54 ± 0.43

ROCm 6.2.4 gfx900 force 1 device
[cb88@M31-AR0 bin]$ HSA_OVERRIDE_GFX_VERSION=9.0.0 ./llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	501.64 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	88.47 ± 0.06

ROCm 6.2.4 gfx906 force 1 device (same results with -sm none -mg 0)
[cb88@M31-AR0 bin]$ ./llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1301.68 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	90.17 ± 0.04

MI25 110W in same system ROCM 6.2

model	size	params	backend	ngl	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	none	pp512	325.47 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	none	tg128	52.30 ± 0.18

MI25 110W Vulkan Radv 24.3.3

model	size	params	backend	ngl	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	none	pp512	348.77 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	none	tg128	47.29 ± 0.21

0 replies

TimCabbage · 2025-01-13T22:29:40Z

TimCabbage
Jan 13, 2025

AMD Ryzen 5 8645HS w/ Radeon 760M Graphics 4.30 GHz
Nvidia 4050 mobile

Vulkan:
$ llama-bench -ngl 100 -m ../models/llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	100	pp512	1154.28 + 15.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	100	tg128	41.89 + 0.10

build: d79d8f3 (4393)

CUDA:
$ llama-bench -ngl 100 -m ../models/llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4050 Laptop GPU, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	100	pp512	1725.85 + 17.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	100	tg128	43.72 + 0.41

build: d79d8f3 (4393)

0 replies

ericcurtin · 2025-01-14T10:23:49Z

ericcurtin
Jan 14, 2025
Collaborator

How does Kompute compare? I know @slp is using Kompute backend alternatively for podman machine

1 reply

0cc4m Jan 14, 2025
Collaborator

I posted a benchmark here: #11217 (comment)

jimkberry · 2025-01-14T21:19:31Z

jimkberry
Jan 14, 2025

Nothing overly exciting, but a GPU that wasn't listed yet. Win11

ggml_vulkan: 0 = AMD Radeon RX 6600 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	100	pp512	574.65 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	100	tg128	53.92 ± 0.11

build: 091592d (4481)

0 replies

jeffbolznv · 2025-01-15T15:27:48Z

jeffbolznv
Jan 15, 2025
Collaborator

RTX 4070 with Vulkan Developer Driver 553.51 (includes NV_coopmat2 support). llama.cpp commit 9a48399.

Default settings:

llama-bench.exe -m c:\models\llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders....................................................................................Done!
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |      3970.59 ± 12.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         93.87 ± 0.53 |

build: 9a483999 (4465)

With flash attention (-fa 1):

llama-bench.exe -m c:\models\llama-2-7b.Q4_0.gguf -ngl 100 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders....................................................................................Done!
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |         pp512 |      4293.57 ± 27.70 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |         tg128 |         91.49 ± 0.89 |

With KHR_coopmat instead (GGML_VK_DISABLE_COOPMAT2=1):

llama-bench.exe -m c:\models\llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders...................................................Done!
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |      3179.37 ± 46.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         92.29 ± 0.28 |

1 reply

0cc4m Jan 16, 2025
Collaborator

Here's RTX 3090 with Vulkan Developer Driver 550.40.82 and coopmat2 support, llama.cpp commit 4dbc8b9

Default:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	4368.74 ± 48.23
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	118.50 ± 0.09

With flash attention:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	4761.91 ± 5.84
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	107.32 ± 0.34

daniandtheweb · 2025-01-15T19:08:22Z

daniandtheweb
Jan 15, 2025

AMD Radeon RX 5700 XT on Windows 11 (latest AMD drivers).
build: 1d85043 (4488)

Vulkan Windows 11:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	100	pp512	609.57 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	100	tg128	67.53 ± 0.08

Here I also post the Linux results for the proprietary driver.

Vulkan AMDGPU-PRO, Arch Linux:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	605.40 ± 1.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	69.14 ± 0.03

0 replies

vkhodygo · 2025-01-18T14:54:52Z

vkhodygo
Jan 18, 2025

Venerable T480 with i5-8350U on board:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) UHD Graphics 620 (KBL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	25.28 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	3.23 ± 0.00

build: f26c874 (4505)

Without extra tricks I'm afraid this is the absolute limit.

0 replies

zaps166 · 2025-01-18T15:14:27Z

zaps166
Jan 18, 2025

Linux 6.12.9, Radeon RX 6900 XT, build 44e18ef (4503), pp_power_profile_mode set to VR, 2489 MHz GPU clock

AMDGPU-PRO (6.3.0)

ggml_vulkan: 0 = AMD Radeon RX 6900 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1257.98 ± 1.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	101.42 ± 0.02

Mesa RADV 23.3.3

ggml_vulkan: 0 = AMD Radeon RX 6900 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1018.94 ± 2.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	95.70 ± 0.01

AMD HIP (6.2.4) for comparison

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1963.96 ± 2.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	85.48 ± 0.17

0 replies

Performance of llama.cpp with Vulkan #10879

netrunnereve Dec 18, 2024 Collaborator

Replies: 25 comments · 25 replies

netrunnereve Dec 18, 2024 Collaborator Author

netrunnereve Dec 18, 2024 Collaborator Author

max-krasnyansky Dec 18, 2024 Collaborator

ericcurtin Jan 14, 2025 Collaborator

0cc4m Jan 8, 2025 Collaborator

netrunnereve Jan 8, 2025 Collaborator Author

0cc4m Jan 8, 2025 Collaborator

NVIDIA GeForce RTX 3090 (NVIDIA)

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

0cc4m Jan 8, 2025 Collaborator

netrunnereve Jan 8, 2025 Collaborator Author

0cc4m Jan 8, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 11, 2025 Collaborator

0cc4m Jan 9, 2025 Collaborator

netrunnereve Jan 9, 2025 Collaborator Author

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 12, 2025 Collaborator

netrunnereve Jan 12, 2025 Collaborator Author

ericcurtin Jan 14, 2025 Collaborator

0cc4m Jan 14, 2025 Collaborator

jeffbolznv Jan 15, 2025 Collaborator

0cc4m Jan 16, 2025 Collaborator

netrunnereve
Dec 18, 2024
Collaborator

Replies: 25 comments 25 replies

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator Author

max-krasnyansky
Dec 18, 2024
Collaborator

ericcurtin Jan 14, 2025
Collaborator

0cc4m Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m
Jan 8, 2025
Collaborator

0cc4m
Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m Jan 8, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator

0cc4m Jan 11, 2025
Collaborator

0cc4m
Jan 9, 2025
Collaborator

netrunnereve Jan 9, 2025
Collaborator Author

0cc4m Jan 10, 2025
Collaborator

0cc4m
Jan 10, 2025
Collaborator

0cc4m Jan 12, 2025
Collaborator

netrunnereve Jan 12, 2025
Collaborator Author

ericcurtin
Jan 14, 2025
Collaborator

0cc4m Jan 14, 2025
Collaborator

jeffbolznv
Jan 15, 2025
Collaborator

0cc4m Jan 16, 2025
Collaborator