-
Notifications
You must be signed in to change notification settings - Fork 12.2k
add GGML_USE_NUMA_MIGRATE feature to optimize cross NUMA op computation #13649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
load model error: "not enough space in the buffer" |
The issue should be caused by the GGML_PAD with page size alignment hence the overall size exceeds the original allocated size in ggml_backend_cpu_buffer_type_alloc_buffer(). |
model: Qwen3-30B-A3B-Q4_K_M |
OK, Seems an issue also related with platform, could you please retry the latest commit? I added the size with num of tensors * page_size, which should address the issue here. it should not have big impact on memory consumption, for example, for Qwen3-235B-A22B Q8_0 which has 1131 tensors, the increased memory would be 1131*4k=4.4M. |
|
Good thing is it seems it passed the buffer allocation. what's the backtrace for the seg fault? |
debug log info |
Do you have a core dump generated? if so, you might use gdb to open it to see the backtrace.
it would be better if you could build a debug build, and then use gdb attach to it.
BTW, what's the running command of your test? |
llama-bench -m Qwen3-30B-A3B-Q4_K_M.gguf -t 50 -p 512 -n 128 --numa migrate --verbnose |
so the cgraph of running llama-bench is null, I've fixed it in the latest commit, could you please give it a try? |
OK, so we could see there's obvious performance uplift for pp512 test, though there's no improvement for tg128 (guess the cross numa traffic isn't the bottleneck in the case). |
…rier as the barrier only happens in few cores
|
I saw there're 2 runs with the feature, the first is at 29.51 t/s and second run is 34.63 t/s, the base data is stable at 29.72 t/s, could you rerun the pp512 with the feature to see if it's stable at 34 t/s? if that's the case, then we could observe 17% uplift. anyway, x86 has better cross NUMA traffic latency than aarch64. |
Hi @Warshmao , will you do more testing with pp512? Or do you have an aarch64 system to verify it?
And for aarch64, there's 16.8% uplift for pp512 and 25.9% uplift for tg128 with 64 threads as base data (64 cores per numa node in the platform, will get best performance with 64 threads in aarch64 as base data), if comparing with 128 threads as base data, that's 27.9% and 140% uplift for pp512 and tg128 respectively:
So the feature would improve performance significantly for aarch64 platform. |
@Nor7th thanks for your review.
no, it's not 100% achieved, depends on the tensor data, in some models like llama 3, the tensor data of src0 and dst are basically the 2x page sizes, and so each numa nodes could handle data locally (for example node 0 handles page 0 while node 1 handles page 1, with node 0 has 64 cores, each core handles 4096/64=64 bytes), but for tensor data which is not 2x page sizes, then it's not handled in the case (pages won't be moved).
src1 has to be read during gemv/gemm with an entire buffer, but the quantization is not the bottleneck (it's much more faster than gemm/gemv), so by introducing a numa-local wdata will speed up the gemm/gemv. with src1 opt:
forward_mul_mat_id is not the hotspot so it's not touched in the case (it won't have obvious improvement with the change). BTW, I'll create a PR for the change and rebase the changes into a single commit also merge with the latest llama.cpp, so i'll close the PR. |
close the PR, filed a new one with all commits rebased as one and merged with latest commit in upstream: |
@wenlujon Thank you for clarifying most of my doubts. Really appreciate that! The only point you might missed is the "fake" even distribution of tensor data pages: However the above cases should have a small proportion. The overall improvement is still good enough. |
please note not only the size of tensor data are numbers of page sizes, but the start address of the tensor data is guaranteed to allocated page aligned (see the get alignment change), so if there're 4 pages in the case, each node would process 2 pages, why there's padding bytes in the last page? |
@Nor7th that's a good catch, from the model i'm using (llama3), the size for mal_mut op from ggml_backend_buft_get_alloc_size() are page aligned, so migrate_pages_with_cache() would migrate pages for them, other OPs are not intended to be moved, so migrate_pages_with_cache() would just bail out: |
This PR adds GGML_USE_NUMA_MIGRATE feature to optimize cross NUMA op computation as the cross-NUMA memory access would be the bottleneck if spawning threads across NUMA nodes:
Test Results
With the feature, tested on NeoVerse N2 with multiple NUMA nodes (with 64 cores per NUMA node), see performance improvements with running llama3 model.
Base data
running command:
results:
Result with the feature
running command:
results:
For example, there's S_TG t/s 22% performance improvement with B=1, slightly S_PP t/s drop (1.8%), overall 20.7% S T/s improvement.
Tested perplexity, there's no regression with the feature:
Enablement
The feature is disabled by default, this is the guidances to enable the feature: