Add option not to abort on cuda malloc errors #1083

WilliamTambellini · 2025-01-23T22:02:17Z

As today ggml force aborts the process whenever there is a cuda malloc failure: eg:

#2  0x00007f99c75ca66e in ggml_abort.cold () 
#3  0x00007f99c7b57882 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () 
#4  0x00007f99c7b5ae80 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#5  0x00007f99c7b648a6 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#6  0x00007f99c7b6a2e8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7  0x00007f99c7b21715 in ggml_backend_sched_graph_compute_async () from /home/nbuild/pub/xmt/latest/lib/libsdl-xnn-ggml.so

This is not ideal for some production context in which we need to have a controlled way to return an OOM error and exit/reload/resume/skip gracefully.
Would you mind if I:

add an option (eg GGML_NO_ABORT_ON_OOM) to skip abort if malloc failures
return a GGML_STATUS_ALLOC_FAILED to upper calls in the stack (ggml_cuda_mul_mat, ...) if cuda_malloc failed

?

Note:

ggml would still have same behavior as today: abort in all cases
this would be just for malloc failures: would still abort in all other cases.

Best
W.

The text was updated successfully, but these errors were encountered:

slaren · 2025-01-24T23:19:47Z

The goal of returning an error instead of crashing is good, we definitely need to do that. The problem is that much of the existing code does not check the error code returned by the graph_compute functions, so making this change could cause existing code to start failing silently in strange ways. I imagine that's where the idea of gating this behind a compile time switch comes from. Ideally we would at least fix the existing code in llama.cpp to make sure that it always check for errors, but if the code is good and ensures that the CUDA backend never ends in a inconsistent state due to this, the change as proposed here would still be good enough to merge.

WilliamTambellini · 2025-01-28T17:20:16Z

Tks Diego.
Preparing a minimalist PR for early review if you dont mind.

WilliamTambellini · 2025-02-11T21:05:54Z

Ok @slaren Diego, ready for a 1st review please:
#1110
Tks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option not to abort on cuda malloc errors #1083

Add option not to abort on cuda malloc errors #1083

WilliamTambellini commented Jan 23, 2025 •

edited

Loading

slaren commented Jan 24, 2025 •

edited

Loading

WilliamTambellini commented Jan 28, 2025

WilliamTambellini commented Feb 11, 2025

Add option not to abort on cuda malloc errors #1083

Add option not to abort on cuda malloc errors #1083

Comments

WilliamTambellini commented Jan 23, 2025 • edited Loading

slaren commented Jan 24, 2025 • edited Loading

WilliamTambellini commented Jan 28, 2025

WilliamTambellini commented Feb 11, 2025

WilliamTambellini commented Jan 23, 2025 •

edited

Loading

slaren commented Jan 24, 2025 •

edited

Loading