Eval bug: Slow prompt processing with Q4_K_S

### Name and Version

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes
version: 4929 (3d82dbcb)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Ryzen 7 6800H + RTX 3050 Ti Laptop

### Models

gemma-2-9b-it Q4_K_S
gemma-3-12b-it Q4_K_S

Self-built GGUFs

### Problem description & steps to reproduce

I'm seeing very slow prompt processing speeds on Gemma 2 and 3 models at Q4_K_S. Seems to be a regression with a recent commit.

| | Prompt processing | Generation |
| - | - | - |
| Before | 420.10 T/s | 6.44 T/s |
| After | 37.07 T/s | 6.54 T/s|

The same models at IQ4_XS still work fine. Have not tried other models yet.

Running llama-server with:

```
$ build/bin/llama-server -m /path/to/gemma-2-9b-it.Q4_K_S.gguf -c 8192 -ngl 14 -sp
```

### First Bad Commit

Git bisect: https://github.com/ggml-org/llama.cpp/commit/3d82dbcbce2c677fc35fbf99574ccd107d95a1f8

### Relevant log output

```shell
none
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Slow prompt processing with Q4_K_S #12481

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Slow prompt processing with Q4_K_S #12481

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions