Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332

Srihari-mcw · 2025-03-11T13:24:41Z

The PR contains block interleaving approach for Q4_K quantization for x64/x86 AVX2 SIMD Architecture
Good gains were observed with prompt processing with the above changes compared to the current default path for Q4_K models (Q4_K_M and Q4_K_S)
The GEMM and GEMV functions are implemented for the AVX2 architecture
quantize_q8_K_4x8 function quantizes the float values to block_q8_Kx4 format
repack_q4_K_to_q4_K_8_bl function rearranges the weight in Q4_K format to Q4_Kx8 format(block_q4_Kx8)

Block Interleaving Formats

Block_Q4_Kx8 :

Used to contain data of 8 Q4_K blocks in interleaved fashion
uint8 scales[96] - Scales and Mins from source Q4_K blocks are taken. Corresponding sub block's scales and mins are stored in every 12 bytes within scales[96]
The d and dmin values from source Q4_K blocks are stored together in an array
Quant values from the source Q4_K blocks are sequentially extracted and interleaved into groups of eight bytes

Block_Q8_Kx4:

Delta values of the four block Q8_K structures are stored together
Bsums for two consecutive sub blocks are stored together from one source Q8_K structure, followed by bsums from another Q8_K structure
Quant values from the Q4_8 blocks are interleaved into groups of eight bytes

GCC Linux :

Q4_K_M Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	pp 512	45.80 ± 0.01		57b6abf8 - Base Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	pp 512	70.60 ± 0.08	54.13%	fae86a56 - Updated Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	tg 128	14.09 ± 0.00		57b6abf8 - Base Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	tg 128	13.85 ± 0.00	-1.74%	fae86a56 - Updated Commit

Q4_K_S Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	pp 512	46.60 ± 0.06		57b6abf8 - Base Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	pp 512	77.25 ± 0.29	65.76%	fae86a56 - Updated Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	tg 128	14.91 ± 0.00		57b6abf8 - Base Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	tg 128	14.62 ± 0.00	-1.97%	fae86a56 - Updated Commit

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Additionally the PR was tested for execution with clang linux also

Further the perplexity was tested and found to be similar with the Q4_K_S model :

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q4_K_S	5.8898 +/- 0.03282	57b6abf8 - Base Commit
llama 7B Q4_K_S	5.8889 +/- 0.03282	fae86a56 - Updated Commit

bartowski1182 · 2025-03-12T17:08:17Z

What're your thoughts on the TG speed decrease? Obviously it's pretty negligible overall, but it's curious to see..

Will these efforts also help with ARM repacking?

Srihari-mcw · 2025-03-13T05:21:50Z

Hi @bartowski1182 , we were trying to check on the text generation if the recent introduction of restrict keyword in vec_dot_q4_K_q8_K is causing this difference and tried to replicate the restrict keyword usage in the repacking implementation, but currently we do see this difference.

Regarding ARM Repacking, the repacking of Q4_K has been designed similar to Q4_0_8_8. So there's a possibility of this being extended to ARM, although not sure on finer details. Thanks

ggerganov · 2025-03-13T14:00:06Z

Nice work!

I don't have an easy access to a AVX2 machine, but will try to do some testing soon. I am curious how the performance looks like for different threads and model sizes (e.g. 1B, 3B).

Srihari-mcw · 2025-03-19T14:10:25Z

Update : A couple of updates were made in the PR

The quantize_q8_K_4x8 was vectorized and updated and that resulted in further gains with prompt processing. The status of text generation remains consistent with the previous update
Another minor update is usage of roundf replaced with nearest int to be consistent with the non interleaved version

GCC Linux :

Q4_K_M Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	pp 512	45.80 ± 0.01		57b6abf8 - Base Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	pp 512	73.77 ± 0.05	61.06%	cba0df39 - Updated Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	tg 128	14.09 ± 0.00		57b6abf8 - Base Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	tg 128	13.84 ± 0.01	-1.77%	cba0df39 - Updated Commit

Q4_K_S Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	pp 512	46.60 ± 0.06		57b6abf8 - Base Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	pp 512	81.87 ± 0.29	75.68%	cba0df39 - Updated Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	tg 128	14.91 ± 0.00		57b6abf8 - Base Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	tg 128	14.61 ± 0.00	-2.01%	cba0df39 - Updated Commit

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

The perplexity results are tabulated as follows post changes:

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q4_K_S	5.8898 +/- 0.03282	57b6abf8 - Base Commit
llama 7B Q4_K_S	5.8887 +/- 0.03281	cba0df39 - Updated Commit

ggerganov

Some minor formatting comments.

I did some tests on Ryzen 9 5950X and observe similar results as reported.

ggerganov · 2025-03-19T18:18:44Z

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

@@ -45,6 +45,24 @@ using block_q4_0x8 = block<4, 8>;
 using block_q8_0x4 = block<8, 4>;
 using block_q8_0x8 = block<8, 8>;

+
+struct block_q4_Kx8{


Suggested change

struct block_q4_Kx8{

struct block_q4_Kx8 {

ggerganov · 2025-03-19T18:18:51Z

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

+
+static_assert(sizeof(block_q4_Kx8) == sizeof(ggml_half) * 16 + K_SCALE_SIZE * 8 + QK_K * 4, "wrong q4_K block size/padding");
+
+struct block_q8_Kx4{


Suggested change

struct block_q8_Kx4{

struct block_q8_Kx4 {

ggerganov · 2025-03-19T18:19:52Z

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

+
+            __m256i bsums_r2 = _mm256_maddubs_epi16(one, sb_h2_interleaved);
+
+            for(int l = 0; l < 3; l++) {


There are a few places where the opening bracket should be separated like this:

Suggested change

for(int l = 0; l < 3; l++) {

for (int l = 0; l < 3; l++) {

…architecture (ggml-org#12332) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments

Yangxiaoz · 2025-03-24T02:44:45Z

@Srihari-mcw Hi, could you please help me check this error? ： #12528

fairydreaming · 2025-03-27T12:47:08Z

Do I understand correctly that this PR repackages Q4_K tensors to some other memory layout on the fly during model loading from existing GGUF files? I noticed that after updating llama.cpp to the current master DeepSeek R1 Q4_K_S model loading times (with the model file already cached) increased from:

llama_perf_context_print: load time = 25738.14 ms

to:

llama_perf_context_print: load time = 500558.83 ms

and only a single core is active when loading the model. It this how it should work?

ggerganov · 2025-03-27T13:16:18Z

Yes, loading time is increased due to the runtime repacking of the weights.

Nor7th · 2025-04-02T01:49:40Z

Hi, I have one question, why doesn't this patch modify GGML_TYPE_Q8_0 to PARAM_TYPE in forward_mul_mat_id as like in forward_mul_mat? Won't this cause any problem for gemv of Q4_K & Q8_K ? @Srihari-mcw @ggerganov

Srihari-mcw · 2025-04-02T04:57:22Z

Hi @Nor7th , I guess this is addressed in #12544 . The models we tested while developing this particular PR did not go through the particular code path that you pointed out and I guess hence this was missed (Guess MoE models go through the code path you had pointed). Sorry, @Yangxiaoz missed your message and glad the fix is already there

Thanks @ggerganov

Nor7th · 2025-04-02T05:25:00Z

Hi @Nor7th , I guess this is addressed in #12544 . The models we tested while developing this particular PR did not go through the particular code path that you pointed out and I guess hence this was missed (Guess MoE models go through the code path you had pointed). Sorry, @Yangxiaoz missed your message and glad the fix is already there

Thanks @ggerganov

Thanks! Good to know it's already solved.

ggerganov · 2025-04-02T08:06:32Z

Yes #12544 should have fixed that.

sultanqasim · 2025-04-04T16:07:29Z

Aside from the extreme slowdown in model loading times due to repacking, this has also slowed down token generation for me, at least on my Dual Xeon 4216 system.

Before this change, I could load the Cohere Command A model in Q4_K_M almost instantly (excluding the dry run), and token generation was at approximately 1.92 tokens per second. After this change, loading the model takes ages (20+ minutes), and token generation is around 1.63 tokens per second (15% slower).

System:
Dual Xeon 4216

Command:
./llama-cli -m ~/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 32768 --temp 0.5

Command A, with repacking/interleaving:
llama_perf_context_print:        load time = 1221419.61 ms
llama_perf_context_print: prompt eval time =    5232.90 ms /    17 tokens (  307.82 ms per token,     3.25 tokens per second)
llama_perf_context_print:        eval time =   33647.99 ms /    55 runs   (  611.78 ms per token,     1.63 tokens per second)

Command A, without repacking/interleaving:
llama_perf_context_print:        load time =  109457.37 ms
llama_perf_context_print: prompt eval time =    4787.59 ms /    17 tokens (  281.62 ms per token,     3.55 tokens per second)
llama_perf_context_print:        eval time =   32230.27 ms /    62 runs   (  519.84 ms per token,     1.92 tokens per second)

bartowski1182 · 2025-04-04T16:09:50Z

Maybe we need a command line option to disable repacking for those that have the compile flag but doing like it?

sultanqasim · 2025-04-04T16:19:04Z

I just tried out Phi 4 Q4_K_M with the same command line arguments I used for Command A, and had a similar slowdown. Model loading time with repacking on my machine for Phi 4 is at least reasonable-ish (around a minute), but that's still much worse than near-instant loading without the repacking. This commit slowed down token generation from 9.53 T/s to 8.81 T/s, a 7.5% slowdown.

System:
Dual Xeon 4216

Command:
./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa distribute --threads 64 -c 32768 --temp 0.5

Phi 4, with repacking/interleaving:
llama_perf_context_print:        load time =   76962.44 ms
llama_perf_context_print: prompt eval time =     822.04 ms /    22 tokens (   37.37 ms per token,    26.76 tokens per second)
llama_perf_context_print:        eval time =    6582.56 ms /    58 runs   (  113.49 ms per token,     8.81 tokens per second)

Phi 4, without repacking/interleaving:
llama_perf_context_print:        load time =   17113.48 ms
llama_perf_context_print: prompt eval time =     862.61 ms /    22 tokens (   39.21 ms per token,    25.50 tokens per second)
llama_perf_context_print:        eval time =    6401.67 ms /    61 runs   (  104.95 ms per token,     9.53 tokens per second)~~~

sultanqasim · 2025-04-04T16:48:53Z

I did another experiment, only using one socket of my NUMA system. In single socket (non-NUMA) usage, this patch speeds up both token generation and prompt processing significantly, at the expense of slow initial load times due to repacking.

System:
Dual Xeon 4216 (but only using one socket to simulate non-NUMA)

Command:
./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa isolate --threads 32 -c 32768 --temp 0.5

Phi 4, with repacking/interleaving:
llama_perf_context_print:        load time =   77142.05 ms
llama_perf_context_print: prompt eval time =    1171.54 ms /    22 tokens (   53.25 ms per token,    18.78 tokens per second)
llama_perf_context_print:        eval time =    7431.09 ms /    62 runs   (  119.86 ms per token,     8.34 tokens per second)

Phi 4, without repacking/interleaving:
llama_perf_context_print:        load time =   19604.91 ms
llama_perf_context_print: prompt eval time =    1412.99 ms /    22 tokens (   64.23 ms per token,    15.57 tokens per second)
llama_perf_context_print:        eval time =    5974.69 ms /    46 runs   (  129.88 ms per token,     7.70 tokens per second)

sultanqasim · 2025-04-04T16:56:03Z

I think an option (or maybe a default) to disable block interleaving and the associated repacking would be good to have to speed up initial loading times. This patch does speed up both prompt processing and token generation for me in single socket non-NUMA usage.

A separate issue is that this patch doesn't seem to properly handle NUMA systems, with minimal to no improvement in prompt processing, and substantially slower token generation. For Command A, both prompt processing and token generation slowed down with this patch when using NUMA.

sultanqasim · 2025-04-04T17:21:48Z

The repacking process takes ages for large models, and only seems to use around 16% of a single CPU core according to top. I'm guessing that the NUMA performance issues are because of the repacked weights not being redistributed across CPU sockets. It takes my system 20 minutes to repack the 62 GB of quantized weights for Command A, which seems absurdly slow.

slaren · 2025-04-04T17:32:24Z

You should be able to disable repacking with -ot .=CPU.

sultanqasim · 2025-04-04T19:39:11Z

You should be able to disable repacking with -ot .=CPU.

Thanks, this brings it back to the performance before this commit or at least close to it. Nonetheless, making the repacking faster and NUMA friendly would be the ideal solution.

longaaalong · 2025-05-14T02:48:06Z

Hi, is there any evaluation data for pp and tg in other sizes? Is there any acceleration effect of the method when the pp is small, like 16, 32?

Srihari-mcw · 2025-05-14T17:17:28Z

Hi @longaaalong , though we have llama-bench data with the default configuration (pp512) currently with llama-cli we have earlier observed good performance gains with relatively smaller prompt sizes

For eg,
./llama-cli -m "../../../test_models/ggml-model-Q4_K_M.gguf" -p "Once upon a time, the idea of a 20-year mortgage was a" -n 100 -s 46732 (input of 20 tokens)

We observed a gain from ~46 t/s to ~71 t/s for prompt processing

Thanks

Add block interleaving support for Q4_K quantization

fae86a5

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 11, 2025

Srihari-mcw added 2 commits March 12, 2025 04:42

Remove whitespaces and fix CI/CD issues

33bab80

Update pointer of bsums from int16_t to const int16_t

022ad35

Add vector version of quantize_q8_K_4x8 function

cba0df3

ggerganov approved these changes Mar 19, 2025

View reviewed changes

Update code formatting based on review comments

adb86d7

ggerganov merged commit 3d82dbc into ggml-org:master Mar 20, 2025
48 checks passed

ggerganov mentioned this pull request Mar 21, 2025

Eval bug: Slow prompt processing with Q4_K_S #12481

Closed

Yangxiaoz mentioned this pull request Mar 23, 2025

Eval bug: Program not working properly due to new features of "repack Q4_K tensor" #12528

Closed

ggerganov mentioned this pull request Mar 25, 2025

llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181

Merged

sultanqasim mentioned this pull request Apr 4, 2025

Eval bug: Weight repacking for AVX2 block interleaving is very slow and NUMA unfriendly #12759

Closed

Srihari-mcw mentioned this pull request Apr 8, 2025

Add AVX512 implementation of GEMM - Q4_Kx8 #12829

Merged


		static_assert(sizeof(block_q4_Kx8) == sizeof(ggml_half) * 16 + K_SCALE_SIZE * 8 + QK_K * 4, "wrong q4_K block size/padding");

		struct block_q8_Kx4{


		__m256i bsums_r2 = _mm256_maddubs_epi16(one, sb_h2_interleaved);

		for(int l = 0; l < 3; l++) {

Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332

Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332

Uh oh!

Conversation

Srihari-mcw commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Mar 12, 2025

Uh oh!

Srihari-mcw commented Mar 13, 2025

Uh oh!

ggerganov commented Mar 13, 2025

Uh oh!

Srihari-mcw commented Mar 19, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yangxiaoz commented Mar 24, 2025

Uh oh!

fairydreaming commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 27, 2025

Uh oh!

Nor7th commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Srihari-mcw commented Apr 2, 2025

Uh oh!

Nor7th commented Apr 2, 2025

Uh oh!

ggerganov commented Apr 2, 2025

Uh oh!

sultanqasim commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Apr 4, 2025

Uh oh!

sultanqasim commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sultanqasim commented Apr 4, 2025

Uh oh!

sultanqasim commented Apr 4, 2025

Uh oh!

sultanqasim commented Apr 4, 2025

Uh oh!

slaren commented Apr 4, 2025

Uh oh!

sultanqasim commented Apr 4, 2025

Uh oh!

longaaalong commented May 14, 2025

Uh oh!

Srihari-mcw commented May 14, 2025

Uh oh!

Uh oh!

Srihari-mcw commented Mar 11, 2025 •

edited

Loading

fairydreaming commented Mar 27, 2025 •

edited

Loading

Nor7th commented Apr 2, 2025 •

edited

Loading

sultanqasim commented Apr 4, 2025 •

edited

Loading

sultanqasim commented Apr 4, 2025 •

edited

Loading