-
Notifications
You must be signed in to change notification settings - Fork 12.2k
ggml-cpu: enable IBM NNPA Vector Intrinsics #14317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 4a9f60c)
Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 8d4a798)
Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0ff0d65)
Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 2f58bbc)
Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 01b9294)
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change. Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]>
fallback logic was already implemented but i was too sleepy to realise Signed-off-by: Aaron Teo <[email protected]>
Refactored @slaren PTAL again |
I believe these includes can be removed now, or moved to the CPU backend if necessary: llama.cpp/ggml/src/ggml-impl.h Lines 15 to 29 in 1b23fec
|
inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) { | ||
uint16_t s; | ||
memcpy(&s, &f, sizeof(uint16_t)); | ||
return ggml_table_f32_f16[s]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lookup table ggml_table_f32_f16
is still in ggml-base
, it should be moved to ggml-cpu
as well, since it is only used in the CPU backend now. The initialization can be done in ggml_cpu_init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialization can be done in ggml_cpu_init.
So move this line
Line 1430 in 73e53dc
ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16); |
into this code block?
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c
Lines 3437 to 3445 in 73e53dc
for (int i = 0; i < (1 << 16); ++i) { | |
union { | |
uint16_t u16; | |
ggml_fp16_t fp16; | |
} u = {i}; | |
float f = GGML_FP16_TO_FP32(u.fp16); | |
ggml_table_gelu_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_f32(f)); | |
ggml_table_gelu_quick_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_quick_f32(f)); | |
} |
I feel like I'm stepping into dangerous territory 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest push, but I'm unsure if there will be any problems arising out of my code change.
Edit: As expected, the following change had problems with Windows, Vulkan and Server builds. Will revert the patches until I can get a better direction for this 😔
Expand patch change
diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
index 70f32801..ce296898 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -3479,6 +3479,7 @@ void ggml_cpu_init(void) {
ggml_fp16_t fp16;
} u = {i};
float f = GGML_CPU_FP16_TO_FP32(u.fp16);
+ ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
ggml_table_gelu_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_f32(f));
ggml_table_gelu_quick_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_quick_f32(f));
}
diff --git a/ggml/src/ggml-cpu/simd-mappings.h b/ggml/src/ggml-cpu/simd-mappings.h
index 655ab3c6..2f65ccd1 100644
--- a/ggml/src/ggml-cpu/simd-mappings.h
+++ b/ggml/src/ggml-cpu/simd-mappings.h
@@ -137,6 +137,10 @@
}
#endif
+// precomputed f32 table for f16 (256 KB)
+// defined in ggml.c, initialized in ggml_init()
+GGML_API float ggml_table_f32_f16[1 << 16];
+
// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
// so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.
// This is also true for POWER9.
diff --git a/ggml/src/ggml-impl.h b/ggml/src/ggml-impl.h
index 8d9bdc74..57761644 100644
--- a/ggml/src/ggml-impl.h
+++ b/ggml/src/ggml-impl.h
@@ -393,10 +393,6 @@ static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
#define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
-// precomputed f32 table for f16 (256 KB)
-// defined in ggml.c, initialized in ggml_init()
-GGML_API float ggml_table_f32_f16[1 << 16];
-
/**
* Converts brain16 to float32.
*
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index f8e7c595..e0e46288 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -1414,27 +1414,6 @@ static inline bool ggml_can_repeat_rows(const struct ggml_tensor * t0, const str
////////////////////////////////////////////////////////////////////////////////
struct ggml_context * ggml_init(struct ggml_init_params params) {
- static bool is_first_call = true;
-
- ggml_critical_section_start();
-
- if (is_first_call) {
- // initialize time system (required on Windows)
- ggml_time_init();
-
- for (int i = 0; i < (1 << 16); ++i) {
- union {
- uint16_t u16;
- ggml_fp16_t fp16;
- } u = {i};
- ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
- }
-
- is_first_call = false;
- }
-
- ggml_critical_section_end();
-
struct ggml_context * ctx = GGML_MALLOC(sizeof(struct ggml_context));
// allow to call ggml_init with 0 size
--
2.39.5 (Apple Git-154)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would have to move the definition of ggml_table_f32_f16
from ggml.c
to somewhere in the CPU backend as well.
It is moved to the CPU backend already but I left the headers there because there are more SIMD code within llama.cpp/ggml/src/ggml-quants.c Lines 5041 to 5055 in 73e53dc
llama.cpp/ggml/src/ggml-quants.c Lines 5082 to 5096 in 73e53dc
Was wondering if you have a proper place for me to move these into |
Okay I've been trying to move the |
ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
… failures" This reverts commit 32a3533. Signed-off-by: Aaron Teo <[email protected]>
This reverts commit 9e40d98. Signed-off-by: Aaron Teo <[email protected]>
It's not great, but this code is not important, you can ignore it. |
ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 9e40d98)
Signed-off-by: Aaron Teo <[email protected]>
This code can be removed now: llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Lines 3463 to 3468 in 6cebee2
|
Signed-off-by: Aaron Teo <[email protected]>
I am seeing a consistent failure in windows and server CIs. Windows CI:
Server CI:
or, it simply fails to start on windows. I don't know the ggml codebase that well to know why the shift from Edit: Alternative if you're okay with it, we revert all |
we rely on the variable declaration in ggml-cpu.c instead Signed-off-by: Aaron Teo <[email protected]>
This reverts commit f71b21d. Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
This reverts commit 2dce119. Signed-off-by: Aaron Teo <[email protected]>
Please let me know how we should proceed with this PR, I hope to close it soon :) We can either,
I'll be logging off for the night now - feel free to share your thoughts and lets work this out |
This pull request aims to enable the IBM NNPA instruction set for IBM z16 mainframes and later on the s390x platform. This code change is mainly targeted at FP16 -> FP32 or FP32 -> FP16 data conversions.
Note: This PR supersedes #14303 because that implementation was wrong.
Verification
To ensure that this implementation did not break anything, the NNPA instruction set has been tested on the following models:
Performance Results
I will be using IBM Granite 3.3 for the performance tests. We notice a performance improvement of roughly 0.70% for F16 Prompt Processing, and 29.23% for F16 Token Generation, which is the expected outcome.
Before NNPA Instruction Set
After NNPA Instruction Set
Note
Tests were conducted on an IBM z16 Mainframe with 2 IFLs (4 vCores) and 64 GB Memory on z/VM (Type-2)
ggml_compute_fp16_to_fp32
andggml_compute_fp32_to_fp16
SIMD activations are ready. However, I was unable to find a way to make the s390x platform detection macros usable inggml-impl.h
, thus leaving the correct implementation inside first until we can correct it.Edit 1: Note: This PR contains
ggml-base
andggml-cpu
refactor for FP16<->FP32 SIMD as requested in #14317 (comment).Please review this pull request and consider merging into the main repository. Thank you!