Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332
+1,273
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Block Interleaving Formats
Block_Q4_Kx8 :
Block_Q8_Kx4:
GCC Linux :
Q4_K_M Model :
Q4_K_S Model :
GCC Version = 12.3
The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b
The PR was tested in AMD Raphael 7600X which supports the following flags by default :
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Additionally the PR was tested for execution with clang linux also
Further the perplexity was tested and found to be similar with the Q4_K_S model :
The perplexity results are tabulated as follows :