Skip to content

ggml-cpu: enable IBM NNPA Vector Intrinsics #14317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 92 commits into from
Jun 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
5801806
ggml-cpu: add nnpa compile flag
taronaeo Jun 20, 2025
45a4cf6
ggml-cpu: add fp16->fp32 nnpa first
taronaeo Jun 20, 2025
ebf9f34
ggml-cpu: add fp32->fp16
taronaeo Jun 20, 2025
ffe2964
ggml-cpu: better variable names
taronaeo Jun 20, 2025
0394a00
docs: update s390x docs
taronaeo Jun 20, 2025
48b820d
ggml-cpu: add debugging prints to see if dlf16 is correct
taronaeo Jun 21, 2025
d9cc63a
ggml-cpu: fix print vs printf
taronaeo Jun 21, 2025
94f10ca
ggml-cpu: fix float placeholder
taronaeo Jun 21, 2025
8f3a5af
ggml-cpu: ensure fp16 and fp32 load and stores are called
taronaeo Jun 21, 2025
575ea9f
ggml-cpu: fp16 load ensured to hit
taronaeo Jun 21, 2025
9330454
ggml-cpu: remove sigint from fp16 store
taronaeo Jun 21, 2025
ebc1d19
ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
taronaeo Jun 21, 2025
6a25fd8
ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
taronaeo Jun 21, 2025
f9f6c7e
ggml-cpu: nnpa switch to vec_xst test
taronaeo Jun 21, 2025
6d507bb
ggml-cpu: switch to vec_xst for 4 element loops also
taronaeo Jun 21, 2025
8312adc
ggml-cpu: rework noop
taronaeo Jun 21, 2025
27b4c3f
ggml-cpu: remove noop, general code cleanup
taronaeo Jun 21, 2025
e0f8fb9
ggml-cpu: clarify variable naming
taronaeo Jun 21, 2025
bb9345c
ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
taronaeo Jun 21, 2025
5424d9e
ggml-cpu: add breakpoint for debugging
taronaeo Jun 21, 2025
4f017d7
ggml-cpu: test fix for conversion failure
taronaeo Jun 21, 2025
27131e5
ggml-cpu: disable fp32->fp16 nnpa conversions for now
taronaeo Jun 21, 2025
946c78e
ggml-cpu: switch to elif macro
taronaeo Jun 21, 2025
433d587
ggml-cpu: reattempt fp32->fp16
taronaeo Jun 21, 2025
54811fc
ggml-cpu: fix typo
taronaeo Jun 21, 2025
e12e9fe
ggml-cpu: reattempt fp32->fp16
taronaeo Jun 21, 2025
7413dab
ggml-cpu: fix compiler types
taronaeo Jun 21, 2025
373fa28
ggml-cpu: change to typedef vector types
taronaeo Jun 21, 2025
4621a23
ggml-cpu: add 4 element loops for fp32->fp16
taronaeo Jun 21, 2025
987d169
ggml-cpu: clarified vector naming
taronaeo Jun 21, 2025
8ef51b9
ggml-cpu: bring back fp32->fp16 store nnpa
taronaeo Jun 21, 2025
f1b1d98
ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
taronaeo Jun 21, 2025
1547ea2
ggml-cpu: add nnpa macro check in ggml-impl
taronaeo Jun 21, 2025
0e571dd
ggml-cpu: add missing __func__
taronaeo Jun 21, 2025
4ad6efa
ggml-cpu: diagnose why __NNPA__ macro is not being defined
taronaeo Jun 21, 2025
8129838
ggml-cpu: import vecintrin.h to fix compiler errors
taronaeo Jun 21, 2025
e7910fc
ggml-cpu: update macro tests
taronaeo Jun 21, 2025
157f856
ggml-cpu: move s390x typedef to own header file
taronaeo Jun 21, 2025
48df977
Revert "ggml-cpu: move s390x typedef to own header file"
taronaeo Jun 21, 2025
3004a79
ggml-cpu: switch to importing ggml-cpu-impl instead
taronaeo Jun 21, 2025
1cacdd9
ggml-cpu: fix macro declaration
taronaeo Jun 21, 2025
fadc138
ggml-cpu: test more macros
taronaeo Jun 21, 2025
ed76ff6
ggml-cpu: add debug prints
taronaeo Jun 21, 2025
8459338
ggml-cpu: bruteforce macro definitions
taronaeo Jun 21, 2025
72c9143
ggml-cpu: move macro definitions
taronaeo Jun 21, 2025
a91c3ab
ggml-cpu: add ggml-impl.h to cmakelists
taronaeo Jun 21, 2025
ba3513e
ggml-cpu: switch to private macros
taronaeo Jun 21, 2025
18d79e1
ggml-cpu: move s390x typedef to own header file
taronaeo Jun 21, 2025
781c263
ggml-cpu: move things around
taronaeo Jun 21, 2025
263b820
ggml-cpu: bring back compile macros
taronaeo Jun 21, 2025
04a395e
ggml-cpu: switch to quotes for import
taronaeo Jun 21, 2025
c8b3b89
ggml-cpu: add compiler error macro
taronaeo Jun 21, 2025
ebb8489
ggml-cpu: add s390x detection in ggml-src
taronaeo Jun 21, 2025
3ec0bdc
ggml-cpu: bring back compile definitions
taronaeo Jun 21, 2025
e43dc82
ggml-cpu: undo cmakelists work
taronaeo Jun 21, 2025
5c9b083
Revert "ggml-cpu: move s390x typedef to own header file"
taronaeo Jun 21, 2025
1b4dbf4
ggml-cpu: remove typedefs.h
taronaeo Jun 21, 2025
46227c6
ggml-cpu: remove typedef from cmakelists
taronaeo Jun 21, 2025
72965ea
ggml-cpu: add ggml-impl.h future notes
taronaeo Jun 21, 2025
07de57c
ggml-cpu: add todo comment for future reference
taronaeo Jun 21, 2025
489cdf4
ggml-cpu: clarify naming of dlf16
taronaeo Jun 21, 2025
5004e43
ggml-cpu: remove unnecessary target compile definitions
taronaeo Jun 21, 2025
5834dee
ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
taronaeo Jun 23, 2025
bd288e8
ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
taronaeo Jun 24, 2025
4d136cb
docs: update broken huggingface link for s390x
taronaeo Jun 24, 2025
fbb7334
ggml-cpu: fix duplicate func names during compile
taronaeo Jun 24, 2025
e73413b
Revert "ggml-cpu: fix duplicate func names during compile"
taronaeo Jun 24, 2025
8a5e011
Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
taronaeo Jun 24, 2025
17b032f
ggml: refactor fp16<->fp32 simd to ggml-cpu
taronaeo Jun 24, 2025
0367b80
ggml-cpu: fix missing simd-mappings.h import in quants.c
taronaeo Jun 24, 2025
e615f73
ggml-cpu: fix missing simd-mappings.h within repack
taronaeo Jun 24, 2025
3c055a4
ggml-cpu: fix amx mmq missing simd-mappings.h
taronaeo Jun 24, 2025
e4666f9
ggml-cpu: attempt at fixing loongarch failing build
taronaeo Jun 24, 2025
e4a7f84
ggml-cpu: move nnpa together with other fp16<->fp32 simd
taronaeo Jun 24, 2025
1e6ebb2
ggml-cpu: fix wrong refactor of ggml-base
taronaeo Jun 24, 2025
64568ff
ggml: remove dependency on ggml-cpu from ggml-base
taronaeo Jun 24, 2025
a02b360
ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
taronaeo Jun 24, 2025
1b23fec
ggml-cpu: remove mistaken fallback macro
taronaeo Jun 24, 2025
9e40d98
ggml: move ggml_table_f32_f16 to ggml-cpu
taronaeo Jun 25, 2025
32a3533
ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
taronaeo Jun 25, 2025
827fce9
Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci…
taronaeo Jun 25, 2025
5be39c1
Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
taronaeo Jun 25, 2025
59b48e4
ggml: move ggml_table_f32_f16 to ggml-cpu
taronaeo Jun 25, 2025
6cebee2
ggml: move ggml_table_f32_f16 to ggml-cpu.c
taronaeo Jun 25, 2025
5f2a09a
ggml-cpu: extern c ggml_table_f32_f16 + chore docs
taronaeo Jun 25, 2025
f71b21d
ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
taronaeo Jun 25, 2025
176e1db
Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
taronaeo Jun 25, 2025
2dce119
ggml-cpu: bring back ggml_table_f32_f16
taronaeo Jun 25, 2025
bb35ea6
Revert "ggml-cpu: bring back ggml_table_f32_f16"
taronaeo Jun 25, 2025
8efdc0b
fix ggml time initialization
slaren Jun 25, 2025
4ce16fa
fix f32_f16 table init
slaren Jun 25, 2025
97620ac
remove extra line
slaren Jun 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 29 additions & 12 deletions docs/build-s390x.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ cmake --build build --config Release -j $(nproc)
```

**Notes**:
- For faster repeated compilation, install [ccache](https://ccache.dev/)
- By default, VXE/VXE2 is enabled. To disable it (not recommended):

- For faster repeated compilation, install [ccache](https://ccache.dev/)
- By default, VXE/VXE2 is enabled. To disable it (not recommended):

```bash
cmake -S . -B build \
Expand All @@ -41,18 +42,29 @@ cmake --build build --config Release -j $(nproc)
cmake --build build --config Release -j $(nproc)
```

- For debug builds:
- By default, NNPA is enabled when available. To disable it (not recommended):

```bash
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_NNPA=OFF
cmake --build build --config Release -j $(nproc)
```

- For debug builds:

```bash
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Debug \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Debug -j $(nproc)
```

- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:

```bash
cmake -S . -B build \
Expand All @@ -70,7 +82,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas

1. **Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)**

You can find popular models pre-converted and verified at [s390x Ready Models](hf.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).

These models and their respective tokenizers are verified to run correctly on IBM Z & LinuxONE.

Expand Down Expand Up @@ -101,27 +113,33 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
```

For example,

```bash
python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
```

**Notes:**

- The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.

## IBM Accelerators

### 1. SIMD Acceleration

Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14 or EC13. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.

### 2. NNPA Vector Intrinsics Acceleration

### 2. zDNN Accelerator
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.

*Only available in IBM z16 or later system. No direction at the moment.*
### 3. zDNN Accelerator

### 3. Spyre Accelerator
_Only available in IBM z16 or later system. No direction at the moment._

*No direction at the moment.*
### 4. Spyre Accelerator

_No direction at the moment._

## Performance Tuning

Expand Down Expand Up @@ -154,4 +172,3 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
2. **Other Questions**
Please reach out directly to [[email protected]](mailto:[email protected]).
4 changes: 4 additions & 0 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,10 @@ ninja

To read documentation for how to build on Android, [click here](./android.md)

## IBM Z & LinuxONE

To read documentation for how to build on IBM Z & LinuxONE, [click here](./build-s390x.md)

## Notes about GPU-accelerated backends

The GPU may still be used to accelerate some parts of the computation even when using the `-ngl 0` option. You can fully disable GPU acceleration by using `--device none`.
Expand Down
1 change: 1 addition & 0 deletions ggml/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
option(GGML_VXE "ggml: enable vxe" ON)
option(GGML_NNPA "ggml: enable nnpa" ON)

option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
Expand Down
1 change: 1 addition & 0 deletions ggml/include/ggml-cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ extern "C" {
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);

Expand Down
8 changes: 8 additions & 0 deletions ggml/src/ggml-cpu/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)

# TODO: Separation to determine activation of VX/VXE/VXE2
if (${S390X_M} MATCHES "8561|8562")
set(GGML_NNPA OFF)
message(STATUS "z15 target")
list(APPEND ARCH_FLAGS -march=z15)
elseif (${S390X_M} MATCHES "3931")
Expand All @@ -464,7 +465,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endif()

if (GGML_VXE)
message(STATUS "VX/VXE/VXE2 enabled")
list(APPEND ARCH_FLAGS -mvx -mzvector)
list(APPEND ARCH_DEFINITIONS GGML_VXE)
endif()

if (GGML_NNPA)
message(STATUS "NNPA enabled")
list(APPEND ARCH_DEFINITIONS GGML_NNPA)
endif()
elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
message(STATUS "Wasm detected")
Expand Down
19 changes: 10 additions & 9 deletions ggml/src/ggml-cpu/amx/mmq.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
#include "mmq.h"
#include "ggml-impl.h"
#include "ggml-cpu-impl.h"
#include "simd-mappings.h"
#include "quants.h"
#include "ggml-quants.h"
#include <algorithm>
Expand Down Expand Up @@ -453,7 +454,7 @@ void quantize_row_q8_K_vnni(const float * RESTRICT x, void * RESTRICT vy, int64_

// Quantize these floats
const float iscale = 127.f / amax;
y[i].d = GGML_FP32_TO_FP16(1 / iscale);
y[i].d = GGML_CPU_FP32_TO_FP16(1 / iscale);
const float id = ( amax != 0.0f ) ? iscale : 0.f;
const __m512 vscale = _mm512_set1_ps(id);

Expand Down Expand Up @@ -1090,7 +1091,7 @@ struct acc_C<block_q8_0, block_q4_0, is_acc> {
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));

for (int m = 0; m < nr; ++m) {
const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));

__m512 vsum;
Expand All @@ -1113,8 +1114,8 @@ struct acc_C<block_q8_1, block_q4_1, is_acc> {
const __m512 vm0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset + TILE_N * sizeof(ggml_half))));

for (int m = 0; m < nr; ++m) {
const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
const __m512 vs1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].s));
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
const __m512 vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].s));
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));

__m512 vsum;
Expand All @@ -1137,7 +1138,7 @@ struct acc_C<block_q8_0, block_q8_0, is_acc> {
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));

for (int m = 0; m < nr; ++m) {
const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));

__m512 vsum;
Expand Down Expand Up @@ -1437,7 +1438,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q4_0, float, BLOCK_M, BLOCK_N, BLO
va[k] = _mm512_set1_epi32(a_ptr[k]);
vcomp = _mm512_dpbusd_epi32(vcomp, off, va[k]);
}
vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
}

// load b
Expand Down Expand Up @@ -1498,8 +1499,8 @@ struct tinygemm_kernel_vnni<block_q8_1, block_q4_1, float, 1, BLOCK_N, BLOCK_K>
for (int k = 0; k < 8; ++k) {
va[k] = _mm512_set1_epi32(a_ptr[k]);
}
vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
vs1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].s));
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].s));
}

// load b
Expand Down Expand Up @@ -1571,7 +1572,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q8_0, float, BLOCK_M, BLOCK_N, BLO
va[k] = _mm512_set1_epi32(a_ptr[k]);
va[k] = _mm512_add_epi8(va[k], off);
}
vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
}

// load b
Expand Down
Loading
Loading