You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As today ggml force aborts the process whenever there is a cuda malloc failure: eg:
#2 0x00007f99c75ca66e in ggml_abort.cold ()
#3 0x00007f99c7b57882 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4 0x00007f99c7b5ae80 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#5 0x00007f99c7b648a6 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#6 0x00007f99c7b6a2e8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7 0x00007f99c7b21715 in ggml_backend_sched_graph_compute_async () from /home/nbuild/pub/xmt/latest/lib/libsdl-xnn-ggml.so
This is not ideal for some production context in which we need to have a controlled way to return an OOM error and exit/reload/resume/skip gracefully.
Would you mind if I:
add an option (eg GGML_NO_ABORT_ON_OOM) to skip abort if malloc failures
return a GGML_STATUS_ALLOC_FAILED to upper calls in the stack (ggml_cuda_mul_mat, ...) if cuda_malloc failed
?
Note:
ggml would still have same behavior as today: abort in all cases
this would be just for malloc failures: would still abort in all other cases.
Best
W.
The text was updated successfully, but these errors were encountered:
The goal of returning an error instead of crashing is good, we definitely need to do that. The problem is that much of the existing code does not check the error code returned by the graph_compute functions, so making this change could cause existing code to start failing silently in strange ways. I imagine that's where the idea of gating this behind a compile time switch comes from. Ideally we would at least fix the existing code in llama.cpp to make sure that it always check for errors, but if the code is good and ensures that the CUDA backend never ends in a inconsistent state due to this, the change as proposed here would still be good enough to merge.
As today ggml force aborts the process whenever there is a cuda malloc failure: eg:
This is not ideal for some production context in which we need to have a controlled way to return an OOM error and exit/reload/resume/skip gracefully.
Would you mind if I:
?
Note:
Best
W.
The text was updated successfully, but these errors were encountered: