Skip to content

ggml-cpu: enable IBM NNPA Vector Intrinsics #14317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 91 commits into
base: master
Choose a base branch
from

Conversation

taronaeo
Copy link
Contributor

@taronaeo taronaeo commented Jun 21, 2025

This pull request aims to enable the IBM NNPA instruction set for IBM z16 mainframes and later on the s390x platform. This code change is mainly targeted at FP16 -> FP32 or FP32 -> FP16 data conversions.

Note: This PR supersedes #14303 because that implementation was wrong.

Verification

To ensure that this implementation did not break anything, the NNPA instruction set has been tested on the following models:

  • Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q3_K, Q4_K, Q5_K)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice a performance improvement of roughly 0.70% for F16 Prompt Processing, and 29.23% for F16 Token Generation, which is the expected outcome.

Before NNPA Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp512 31.56 ± 0.23
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg128 1.75 ± 0.01
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp512 30.94 ± 0.20
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg128 1.46 ± 0.01

After NNPA Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp512 31.85 ± 0.10
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg128 1.70 ± 0.01
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp512 31.16 ± 0.04
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg128 1.96 ± 0.02

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs (4 vCores) and 64 GB Memory on z/VM (Type-2)

ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 SIMD activations are ready. However, I was unable to find a way to make the s390x platform detection macros usable in ggml-impl.h, thus leaving the correct implementation inside first until we can correct it.

Edit 1: Note: This PR contains ggml-base and ggml-cpu refactor for FP16<->FP32 SIMD as requested in #14317 (comment).

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo added 30 commits June 21, 2025 14:46
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 4a9f60c)
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 8d4a798)
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 0ff0d65)
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 2f58bbc)
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 01b9294)
for some reason, the function is not getting a hit when debugged with
    gdb. we will need to investigate further

Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
@taronaeo taronaeo requested a review from slaren June 24, 2025 16:07
taronaeo added 2 commits June 25, 2025 01:07
fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo <[email protected]>
@taronaeo
Copy link
Contributor Author

Refactored ggml-cpu from GGML_FP16_TO_FP32 to GGML_CPU_FP16_TO_FP32 and GGML_FP32_TO_FP16 to GGML_CPU_FP32_TO_FP16 in the latest push :)

@slaren PTAL again

@slaren
Copy link
Member

slaren commented Jun 24, 2025

I believe these includes can be removed now, or moved to the CPU backend if necessary:

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif // __ARM_FEATURE_SVE
#if defined(__ARM_NEON) && !defined(__CUDACC__) && !defined(__MUSACC__)
// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
//
// $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
//
#include <arm_neon.h>
#endif
#if defined(__F16C__)
#include <immintrin.h>
#endif

@slaren slaren requested a review from Copilot June 24, 2025 19:43
Copilot

This comment was marked as outdated.

Comment on lines +144 to +148
inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
uint16_t s;
memcpy(&s, &f, sizeof(uint16_t));
return ggml_table_f32_f16[s];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup table ggml_table_f32_f16 is still in ggml-base, it should be moved to ggml-cpu as well, since it is only used in the CPU backend now. The initialization can be done in ggml_cpu_init.

Copy link
Contributor Author

@taronaeo taronaeo Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization can be done in ggml_cpu_init.

So move this line

ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);

into this code block?

for (int i = 0; i < (1 << 16); ++i) {
union {
uint16_t u16;
ggml_fp16_t fp16;
} u = {i};
float f = GGML_FP16_TO_FP32(u.fp16);
ggml_table_gelu_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_f32(f));
ggml_table_gelu_quick_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_quick_f32(f));
}

I feel like I'm stepping into dangerous territory 😅

Copy link
Contributor Author

@taronaeo taronaeo Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest push, but I'm unsure if there will be any problems arising out of my code change.

Edit: As expected, the following change had problems with Windows, Vulkan and Server builds. Will revert the patches until I can get a better direction for this 😔

Expand patch change
diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
index 70f32801..ce296898 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -3479,6 +3479,7 @@ void ggml_cpu_init(void) {
                     ggml_fp16_t fp16;
                 } u = {i};
                 float f = GGML_CPU_FP16_TO_FP32(u.fp16);
+                ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
                 ggml_table_gelu_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_f32(f));
                 ggml_table_gelu_quick_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_quick_f32(f));
             }
diff --git a/ggml/src/ggml-cpu/simd-mappings.h b/ggml/src/ggml-cpu/simd-mappings.h
index 655ab3c6..2f65ccd1 100644
--- a/ggml/src/ggml-cpu/simd-mappings.h
+++ b/ggml/src/ggml-cpu/simd-mappings.h
@@ -137,6 +137,10 @@
     }
 #endif
 
+// precomputed f32 table for f16 (256 KB)
+// defined in ggml.c, initialized in ggml_init()
+GGML_API float ggml_table_f32_f16[1 << 16];
+
 // On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
 // so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.
 // This is also true for POWER9.
diff --git a/ggml/src/ggml-impl.h b/ggml/src/ggml-impl.h
index 8d9bdc74..57761644 100644
--- a/ggml/src/ggml-impl.h
+++ b/ggml/src/ggml-impl.h
@@ -393,10 +393,6 @@ static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
 #define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
 #define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
 
-// precomputed f32 table for f16 (256 KB)
-// defined in ggml.c, initialized in ggml_init()
-GGML_API float ggml_table_f32_f16[1 << 16];
-
 /**
  * Converts brain16 to float32.
  *
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index f8e7c595..e0e46288 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -1414,27 +1414,6 @@ static inline bool ggml_can_repeat_rows(const struct ggml_tensor * t0, const str
 ////////////////////////////////////////////////////////////////////////////////
 
 struct ggml_context * ggml_init(struct ggml_init_params params) {
-    static bool is_first_call = true;
-
-    ggml_critical_section_start();
-
-    if (is_first_call) {
-        // initialize time system (required on Windows)
-        ggml_time_init();
-
-        for (int i = 0; i < (1 << 16); ++i) {
-            union {
-                uint16_t u16;
-                ggml_fp16_t fp16;
-            } u = {i};
-            ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
-        }
-
-        is_first_call = false;
-    }
-
-    ggml_critical_section_end();
-
     struct ggml_context * ctx = GGML_MALLOC(sizeof(struct ggml_context));
 
     // allow to call ggml_init with 0 size
-- 
2.39.5 (Apple Git-154)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would have to move the definition of ggml_table_f32_f16 from ggml.c to somewhere in the CPU backend as well.

@taronaeo
Copy link
Contributor Author

taronaeo commented Jun 25, 2025

I believe these includes can be removed now, or moved to the CPU backend if necessary:

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif // __ARM_FEATURE_SVE
#if defined(__ARM_NEON) && !defined(__CUDACC__) && !defined(__MUSACC__)
// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
//
// $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
//
#include <arm_neon.h>
#endif
#if defined(__F16C__)
#include <immintrin.h>
#endif

It is moved to the CPU backend already but I left the headers there because there are more SIMD code within ggml-base that still depend on those headers. E.g.,

#elif defined(__ARM_NEON)
for (; i + 7 < nb; i += 8) {
uint16x8_t v = vld1q_u16(f + i);
uint16x8_t vexp = vandq_u16(v, vdupq_n_u16(0x7c00));
uint16x8_t cmp = vceqq_u16(vexp, vdupq_n_u16(0x7c00));
uint64_t mask = vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(cmp, 4)), 0);
if (mask) {
for (size_t j = 0; j < 8; ++j) {
if (!validate_fp16(f[i + j], i + j)) {
return false;
}
}
GGML_UNREACHABLE();
}
}

#elif defined(__ARM_NEON)
for (; i + 3 < nb; i += 4) {
uint32x4_t v = vld1q_u32((const uint32_t *)f + i);
uint32x4_t vexp = vandq_u32(v, vdupq_n_u32(0x7f800000));
uint32x4_t cmp = vceqq_u32(vexp, vdupq_n_u32(0x7f800000));
uint64_t mask = vget_lane_u64(vreinterpret_u64_u16(vshrn_n_u32(cmp, 8)), 0);
if (mask) {
for (size_t j = 0; j < 4; ++j) {
if (!validate_float(f[i + j], i + j)) {
return false;
}
}
GGML_UNREACHABLE();
}
}

Was wondering if you have a proper place for me to move these into ggml-cpu?

@taronaeo
Copy link
Contributor Author

Okay I've been trying to move the ggml_validate_row_data SIMD operations from ggml-base/ggml-quants.c into the respective ggml-cpu/arch/quants.c but I'm not confident in doing this refactor. Can we move this specific code change to another PR or someone else own the change instead?

@slaren
Copy link
Member

slaren commented Jun 25, 2025

Was wondering if you have a proper place for me to move these into ggml-cpu?

It's not great, but this code is not important, you can ignore it.

@slaren
Copy link
Member

slaren commented Jun 25, 2025

This code can be removed now:

// needed to initialize f16 tables
{
struct ggml_init_params params = { 0, NULL, false };
struct ggml_context * ctx = ggml_init(params);
ggml_free(ctx);
}

@taronaeo
Copy link
Contributor Author

taronaeo commented Jun 25, 2025

I am seeing a consistent failure in windows and server CIs.

Windows CI:

The following tests FAILED:
	 19 - test-thread-safety (NUMERICAL)                    main
	 21 - test-backend-ops (NUMERICAL)                      main

Server CI:

FAILED unit/test_rerank.py::test_rerank - assert 3 == 2

or, it simply fails to start on windows.

I don't know the ggml codebase that well to know why the shift from ggml-base to ggml-cpu causes these problems, but feel free to edit my branch as my work for NNPA Vector Intrinsics are already completed. otherwise, I'd be happy to continue receiving directions but I'll be going in blind 😆

Edit: Alternative if you're okay with it, we revert all ggml_table_f32_f16 back to ggml-base and we handle it in a separate PR because correct me if im wrong, that specific refactor is affecting quite a lot of things

taronaeo added 4 commits June 26, 2025 00:09
we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo <[email protected]>
This reverts commit 2dce119.

Signed-off-by: Aaron Teo <[email protected]>
@taronaeo
Copy link
Contributor Author

taronaeo commented Jun 25, 2025

Please let me know how we should proceed with this PR, I hope to close it soon :)

We can either,

  1. Continue refactoring ggml-base to ggml-cpu via this PR (I'll be hands-off for this option, full write-access to you) or;
  2. Continue refactoring ggml-base to ggml-cpu via your direction (I'll try - but following blindly due to my limited knowledge of the codebase) or;
  3. We revert all ggml_table_f32_f16 changes back to ggml-base where all CIs are green, and open another PR specifically for that refactor. Someone else will own that change instead. (I prefer this imo)

I'll be logging off for the night now - feel free to share your thoughts and lets work this out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants