Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Srihari-mcw
Copy link
Contributor

  • The PR contains block interleaving approach for Q4_K quantization for x64/x86 AVX2 SIMD Architecture
  • Good gains were observed with prompt processing with the above changes compared to the current default path for Q4_K models (Q4_K_M and Q4_K_S)
  • The GEMM and GEMV functions are implemented for the AVX2 architecture
  • quantize_q8_K_4x8 function quantizes the float values to block_q8_Kx4 format
  • repack_q4_K_to_q4_K_8_bl function rearranges the weight in Q4_K format to Q4_Kx8 format(block_q4_Kx8)

Block Interleaving Formats

Block_Q4_Kx8 :

  • Used to contain data of 8 Q4_K blocks in interleaved fashion
  • uint8 scales[96] - Scales and Mins from source Q4_K blocks are taken. Corresponding sub block's scales and mins are stored in every 12 bytes within scales[96]
  • The d and dmin values from source Q4_K blocks are stored together in an array
  • Quant values from the source Q4_K blocks are sequentially extracted and interleaved into groups of eight bytes

Block_Q8_Kx4:

  • Delta values of the four block Q8_K structures are stored together
  • Bsums for two consecutive sub blocks are stored together from one source Q8_K structure, followed by bsums from another Q8_K structure
  • Quant values from the Q4_8 blocks are interleaved into groups of eight bytes

GCC Linux :

Q4_K_M Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 pp 512 45.80 ± 0.01 57b6abf8 - Base Commit
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 pp 512 70.60 ± 0.08 54.13% fae86a56 - Updated Commit
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 tg 128 14.91 ± 0.00 57b6abf8 - Base Commit
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 tg 128 14.62 ± 0.00 -1.97% fae86a56 - Updated Commit

Q4_K_S Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 pp 512 46.60 ± 0.06 57b6abf8 - Base Commit
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 pp 512 77.25 ± 0.29 65.76% fae86a56 - Updated Commit
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 tg 128 14.09 ± 0.00 57b6abf8 - Base Commit
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 tg 128 13.85 ± 0.00 -1.74% fae86a56 - Updated Commit

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Additionally the PR was tested for execution with clang linux also

Further the perplexity was tested and found to be similar with the Q4_K_S model :

The perplexity results are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q4_K_S 5.8898 +/- 0.03282 57b6abf8 - Base Commit
llama 7B Q4_K_S 5.8889 +/- 0.03282 fae86a56 - Updated Commit

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant