Skip to content

Eval bug: Slow prompt processing with Q4_K_S #12481

Closed
@Cynerva

Description

@Cynerva

Name and Version

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes
version: 4929 (3d82dbc)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 7 6800H + RTX 3050 Ti Laptop

Models

gemma-2-9b-it Q4_K_S
gemma-3-12b-it Q4_K_S

Self-built GGUFs

Problem description & steps to reproduce

I'm seeing very slow prompt processing speeds on Gemma 2 and 3 models at Q4_K_S. Seems to be a regression with a recent commit.

Prompt processing Generation
Before 420.10 T/s 6.44 T/s
After 37.07 T/s 6.54 T/s

The same models at IQ4_XS still work fine. Have not tried other models yet.

Running llama-server with:

$ build/bin/llama-server -m /path/to/gemma-2-9b-it.Q4_K_S.gguf -c 8192 -ngl 14 -sp

First Bad Commit

Git bisect: 3d82dbc

Relevant log output

none

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions