Add an option not to abort on cuda OOM #1110

WilliamTambellini · 2025-02-11T20:43:32Z

Warning: Not ready for merge.
Add option not to abort on cuda OOM but return a ggml_status.
The goal is NOT to be able to continue decoding when OOM but just to do a clean controlled exit at higher level.
Needs cmake GGML_NO_ABORT_ON_OOM=ON (default OFF)
Retouch ggml_tallocr_alloc to return a ggml_status. Retouch init_tensor to return a ggml_status.
Add a bool option for ggml_cuda_error() to abort or not, default true. Add a new macro CUDA_CHECK_NO_ABORT()
Ass a new unit test to check the GGML_NO_ABORT_ON_OOM flow.

slaren

I am not convinced about the approach in the CUDA backend. It will require a lot of changes to change every single CUDA_CHECK. I would consider changing CUDA_CHECK to throw an exception instead, and catching them in the ggml-backend functions. The ggml-backend functions must never leak exceptions, so consider adding a noexcept to all the ggml-backend interface functions when building from C++. This will also require ensuring that every resource is allocated via RAII in an exception-safe manner.

slaren · 2025-02-12T00:42:03Z

include/ggml-alloc.h

@@ -19,7 +19,7 @@ struct ggml_tallocr {
 };

 GGML_API struct ggml_tallocr ggml_tallocr_new(ggml_backend_buffer_t buffer);
-GGML_API void                ggml_tallocr_alloc(struct ggml_tallocr * talloc, struct ggml_tensor * tensor);
+GGML_API enum ggml_status    ggml_tallocr_alloc(struct ggml_tallocr * talloc, struct ggml_tensor * tensor);


I don't think it is necessary to change this function, since it does not allocate any memory itself. All errors from this function can be prevented by ensuring that the buffer has enough space.

ok, reverted

slaren · 2025-02-12T00:43:06Z

src/ggml-alloc.c

@@ -150,6 +155,7 @@ static void remove_allocated_tensor(struct ggml_dyn_tallocr * alloc, size_t offs
 }
 #endif

+// Check with reviewer: could that function returns a ggm_status (offset being an arg)


This function also does not allocate any (physical) memory, it is just calculating offsets within a buffer. If it fails, it means there is a bug somewhere else.

ok, but see it can still abort.

The abort is mostly a sanity check, it cannot happen if everything is working as expect. If it fails, it means there is a serious bug in ggml.

slaren · 2025-02-12T00:45:48Z

src/ggml-alloc.c

+// Returns true on success, false otherwise
+// Check with reviewers: any cons to return a ggml_status?


It would be ok to change the gallocr functions to return a ggml_status.

ok, retouching that PR.

slaren · 2025-02-12T00:47:04Z

src/ggml-backend-impl.h

@@ -44,7 +44,7 @@ extern "C" {
        // base address of the buffer
        void *       (*get_base)     (ggml_backend_buffer_t buffer);
        // (optional) initialize a tensor in the buffer (eg. add tensor extras)
-        void         (*init_tensor)  (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
+        enum ggml_status  (*init_tensor)  (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);


All backends that use this function will need to be updated. It would be preferable to open the PR in llama.cpp since it has much better CI.

ok. To move forward step after step, would you accept a PR in llamacpp with just that init_tensor change?

slaren · 2025-02-12T00:48:59Z

src/ggml-cuda/common.cuh

@@ -79,18 +79,19 @@

 #define GGML_CUDA_MAX_STREAMS 8

-[[noreturn]]
-void ggml_cuda_error(const char * stmt, const char * func, const char * file, int line, const char * msg);
+// Print the error. Will also abort if abort true


I am not sure that the abort parameter is necessary. The cuBLAS functions may also allocate memory and fail (CUBLAS_CHECK).

I would propose to keep the abort bool option since it s up to the developper to decide to allow abort or not.
For cublas, I could add a CUBLAS_CHECK_NO_ABORT() if you d like me too.

As I mentioned below, I do not think there is any case where aborting on a CUDA call failure is acceptable. We must allow applications to deal with these errors, we can't just make their applications disappear without explanation when something unexpected happens.

Hum, at least we agree that the abort is/was too brutal.
I have introduced the abort bool in order to make a difference between cuda failures that are today aborting and oom failures that are aborting too (as today) but for which we dont want to.
At the moment our goal is just to catch ooms, not to handle and forward upward all cuda failures (oom or not).
So you propose to extend the scope of that PR to all cuda failures, right?

It's not necessary to extend the scope of the PR, you can leave the aborts on functions that don't have a way to return an error, like the buffer functions. However you will still need to catch the exceptions and turn them into a GGML_ABORT. In the future we can extended the ggml API to return errors in more conditions. Adding an abort parameter is just going to add a lot of changes that will need to be reverted in the future anyway.

hum, so something like

try { CUDA_CHECK(dosomething()); } catch(std::exception) { GGML_ABORT(); }

would be a nightmare as there are hundred of CUDA_CHECK calls in ggml-cuda.cu.

Would nt it be simpler to add the throw in CUDA_CHECK_GEN

#define CUDA_CHECK_GEN(err, success, error_fn) \ do { \ auto err_ = (err); \ if (err_ != (success)) { \ ggml_cuda_error(#err, __func__, __FILE__, __LINE__, error_fn(err_)); \ } \ throw (err == oom ? std::bad_alloc(...) : std::runtime_error(...)); } while (0)

?

You don't need a try..catch block for every CUDA_CHECK, only one for each ggml-backend interface function. For example:

static void ggml_backend_cuda_buffer_set_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const void * data, size_t offset, size_t size) try { ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context; ggml_cuda_set_device(ctx->device); CUDA_CHECK(cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, cudaStreamPerThread)); CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread)); } catch (const std::exception & e) { GGML_ABORT("%s", e.what()); }

If it was an easy refactor, we would have already done it. If you add an abort parameter to every CUDA_CHECK, you will be adding to the work that will need to be done in the future.

src/ggml.c

leok7v · 2025-02-12T05:38:22Z

Same goes for Metal on MacOS. At the moment metal is usable on some pre Apple Silicon Macs (amazingly actually) and on some it just crashes or hangs allocating kernels. Maybe not worth an effort though

WilliamTambellini · 2025-02-12T17:33:21Z

Tks @slaren
Please reply to these questions before I retouch this PR and/or prepare the PR in llama.cpp.

I am not convinced about the approach in the CUDA backend. It will require a lot of changes to change every single CUDA_CHECK. I would consider changing CUDA_CHECK to throw an exception instead, and catching them in the ggml-backend functions. The ggml-backend functions must never leak exceptions, so consider adding a noexcept to all the ggml-backend interface functions when building from C++. This will also require ensuring that every resource is allocated via RAII in an exception-safe manner.

I m ok to retouch the way you prefer it but please be precise in order to save time for everybody.
Do you propose something like:

void ggml_cuda_error(...) {
  ...
  if (THROW_ON_CUDA_ERROR)
    throw std::runtime_error(...);
  else
    GGML_ABORT(...);
}

?
Tks

graehl · 2025-02-12T18:03:57Z

Since a backend cannot allow exceptions to escape (to allow dynamic linking at the backend API boundary), it seems we only have the question of which backend functions need modification to allow non-abort handling of allocation failures. Is it really only ggml_tallocr_alloc and init_tensor, @WilliamTambellini?
I suggest documenting in the new API that the backend is still usable after such a return (i.e. immediately retrying with smaller input is permissible).
Clearly there are many aborts in ggml generally (and the cuda backend arguably) that should stay as aborts in this PR.
@slaren is correct that provisions to clean up on OOM return (formerly abort) would need to be verified and/or created before the cuda backend can correctly return an OOM error instead of aborting.
slaren's idea that exceptions could be used internal to the cuda backend makes sense (and that this implies that cuda backend internal cleanup provisions are RAII+exception safe).
I'm not sure about slaren's suggestion of making every CUDA_CHECK throw an exception, as this increases the number of code paths that need to be made exception safe, considering the overall purpose of this PR is just to support orderly error return for OOM conditions. But if ggml veterans think that there will be additional recoverable-error conditions then this more general approach could be a decent investment.

slaren · 2025-02-12T18:04:09Z

Yes, that's what I am proposing. Throw an exception in the CHECK macros in case of failure, and catch them in the ggml-backend functions that can fail to return an error to the caller.

slaren · 2025-02-12T18:11:34Z

I'm not sure about slaren's suggestion of making every CUDA_CHECK throw an exception, as this increases the number of code paths that need to be made exception safe, considering the overall purpose of this PR is just to support orderly error return for OOM conditions. But if ggml veterans think that there will be additional recoverable-error conditions then this more general approach could be a decent investment.

My opinion is that we should only abort when some pre-condition that is expected to be met by the caller is not. These are programming errors that indicate that ggml is not being used correctly, and usually can be fixed easily. However we should never crash the application just because a CUDA function returns an error - we must always provide applications some way to recover from this, or at least give it a chance to shut down cleanly.

WilliamTambellini · 2025-02-12T18:19:46Z

Yes, that's what I am proposing. Throw an exception in the CHECK macros in case of failure, and catch them in the ggml-backend functions that can fail to return an error to the caller.

hum, your reply is confusing: my question was about retouching the ggml_cuda_error() fn but you are speaking about the CHECK macro.
Again, please lets be precise: are you speaking about:
CUDA_CHECK_GEN(...) ?
CUDA_CHECK(...) ?
Tks

slaren · 2025-02-12T18:22:46Z

I mentioned the CHECK macros because that's what the code uses to check CUDA calls, ggml_cuda_error is just an implementation detail of the CHECK macros. All the CHECK macros call ggml_cuda_error, it's ok to throw the exception from it.

graehl · 2025-02-12T19:41:15Z

Ok, the scope of "all CUDA errors" makes sense, and @slaren is of course correct that inserting a throw in the CUDA_CHECK macros to be caught at or before the backend API level (along with making all users of the macro and their callers exception-safe) would achieve this.

WilliamTambellini · 2025-02-12T20:25:17Z

@slaren we are perhaps moving forward although I would prefer not to extend the scope of that PR to all cuda errors. Now, if you tell me precisely what you prefer, perhaps I could do it.
Do you propose something like:

void ggml_cuda_error(const char * stmt, const char * func, const char * file, int line, const char * msg) {
    int id = -1; // in case cudaGetDevice fails
    (void)cudaGetDevice(&id);

    GGML_LOG_ERROR(GGML_CUDA_NAME " error: %s\n", msg);
    GGML_LOG_ERROR("  current device: %d, in function %s at %s:%d\n", id, func, file, line);
    GGML_LOG_ERROR("  %s\n", stmt);
#ifndef __CUDA_ARCH__
    throw std::runtime_error(msg);
#endif
}

?

slaren · 2025-02-12T20:29:09Z

Yes. To summarize:

Throw exceptions on CUDA error
Catch them in all the ggml-backend interface functions that call CUDA functions that may fail
In cases where returning an error from the function is currently not possible without significant changes to the ggml-backend interface, make it abort anyway after catching the exception

WilliamTambellini · 2025-02-13T17:24:49Z

src/ggml-alloc.c

@@ -865,6 +876,7 @@ static bool ggml_gallocr_needs_realloc(ggml_gallocr_t galloc, struct ggml_cgraph
    return false;
 }

+// Check with reviewers: any cons to return a ggml_status here?


@slaren and here?

WilliamTambellini · 2025-02-13T18:16:38Z

@slaren voila, just pushed pass 2. Atm keeping the legacy behavior to abort on cuda errors but added an envvar to throw instead. Tks

WilliamTambellini · 2025-03-18T01:30:00Z

rebased.
@slaren ?

WilliamTambellini · 2025-03-24T16:58:49Z

@ggerganov ?

Add option not to abort on cuda OOM but throw/return a ggml_status. The goal in this ticket is NOT to be able to continue inference when OOM, but just to do a clean controlled exit at higher level. No change to default behavior (abort). Retouch ggml_tallocr_alloc to return a ggml_status. Add a new (cuda build only) unit test to check the no abort flow (skiped if the envvar GGML_CUDA_NO_ABORT is not set).

WilliamTambellini mentioned this pull request Feb 11, 2025

Add option not to abort on cuda malloc errors #1083

Open

slaren reviewed Feb 12, 2025

View reviewed changes

WilliamTambellini commented Feb 13, 2025

View reviewed changes

WilliamTambellini force-pushed the noabort branch from db9a2af to 5c4d9b6 Compare February 13, 2025 18:14

WilliamTambellini force-pushed the noabort branch from 5c4d9b6 to 8397ae4 Compare February 13, 2025 18:39

WilliamTambellini force-pushed the noabort branch from 8397ae4 to 6d92803 Compare March 10, 2025 22:11

WilliamTambellini requested a review from slaren March 11, 2025 18:04

WilliamTambellini force-pushed the noabort branch from 6d92803 to 81f43a6 Compare April 15, 2025 15:51

		// Returns true on success, false otherwise
		// Check with reviewers: any cons to return a ggml_status?

Add an option not to abort on cuda OOM #1110

Are you sure you want to change the base?

Add an option not to abort on cuda OOM #1110

Uh oh!

Conversation

WilliamTambellini commented Feb 11, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leok7v commented Feb 12, 2025

Uh oh!

WilliamTambellini commented Feb 12, 2025

Uh oh!

graehl commented Feb 12, 2025

Uh oh!

slaren commented Feb 12, 2025

Uh oh!

slaren commented Feb 12, 2025

Uh oh!

WilliamTambellini commented Feb 12, 2025

Uh oh!

slaren commented Feb 12, 2025

Uh oh!

graehl commented Feb 12, 2025

Uh oh!

WilliamTambellini commented Feb 12, 2025

Uh oh!

slaren commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WilliamTambellini commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WilliamTambellini commented Mar 18, 2025

Uh oh!

WilliamTambellini commented Mar 24, 2025

Uh oh!

Uh oh!

slaren Feb 12, 2025 •

edited

Loading

slaren commented Feb 12, 2025 •

edited

Loading

WilliamTambellini commented Feb 13, 2025 •

edited

Loading