Closed
Description
Name and Version
$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes
version: 4929 (3d82dbc)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 7 6800H + RTX 3050 Ti Laptop
Models
gemma-2-9b-it Q4_K_S
gemma-3-12b-it Q4_K_S
Self-built GGUFs
Problem description & steps to reproduce
I'm seeing very slow prompt processing speeds on Gemma 2 and 3 models at Q4_K_S. Seems to be a regression with a recent commit.
Prompt processing | Generation | |
---|---|---|
Before | 420.10 T/s | 6.44 T/s |
After | 37.07 T/s | 6.54 T/s |
The same models at IQ4_XS still work fine. Have not tried other models yet.
Running llama-server with:
$ build/bin/llama-server -m /path/to/gemma-2-9b-it.Q4_K_S.gguf -c 8192 -ngl 14 -sp
First Bad Commit
Git bisect: 3d82dbc
Relevant log output
none