ggml-cpu: Build variant targeting Neoverse-V2 #14380
Open
+85
−55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As a first improvement on the recently added generic ARM support for GGML_CPU_ALL_VARIANTS, this builds a variant targeting Neoverse-V2 specifically (eg: Graviton4 or NVIDIA Grace).
-mcpu=
rather than a generic-march=
GGML_ARM_MCPU
passed on to the scoring function/proc/cpuinfo
on Linux (Graviton4 is Linux-only and I'd guess NVIDIA Grace, too), and uses it in scoring.In the scoring function, I shifted features to the 9th bit and beyond. The idea being that features are more important than microarchitecture, platform, whatever, which can use bits 2-8 to rank themselves. So nuances like the microarchitecture of two variants become relevant in scoring only if they have otherwise equal features, otherwise features win. I thought this might be a useful convention.
I tested this on Graviton4, where the
neoverse-v2
variant indeed received a higher score than thearmv8.6-a
variant, which would also work for Neoverse-V2 as it isarmv8.6-a
.neoverse-v2
is also what theGGML_NATIVE=ON
build targets.I did not see meaningful benchmark improvements over generic
armv8.6-a
, but I tested only limited models, and only with 4 vCPUs. Some tests ran with 2-3% improvement, but this wasn't always reproducible. I hope to get more AWS resources in July where I can properly test this on a dedicated box.In any case, I think this would at least serve as an easy-to-copy template for other variants where this might matter more.
This supersedes #14332.