RuntimeError: CUDA error: out of memory #2524

DiegoRRR · 2024-12-31T22:35:49Z

Sometimes it works for hours without having this issue, and sometimes I get this crash every few minutes...
I don't run in batch though, only one pic at a time, the resolution is lower than 1024x1024, my GPU has 12GB of VRAM which is supposed to be more than enough for SDXL. Also I tried adding --medvram to COMMANDLINE_ARGS but this did not solve the issue.
Can someone help, please?

(...)
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Curr
ent free memory is 3516.46 MB ... Done.
100%|██████████████████████████████████████████| 26/26 [00:15<00:00,  1.66it/s]
[Unload] Trying to free 3369.09 MB for cuda:0 with 1 models keep loaded ... Curr
ent free memory is 3506.33 MB ... Done.
Exception in thread MemMon:
Traceback (most recent call last):
  File "threading.py", line 1016, in _bootstrap_inner
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\memmon.py", line 53, i
n run
    free, total = self.cuda_mem_get_info()
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\memmon.py", line 34, i
n cuda_mem_get_info
    return torch.cuda.mem_get_info(index)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\cuda\memory.py", line 663, in mem_get_info
    return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules_forge\main_thread.py",
 line 30, in work
    self.result = self.func(*self.args, **self.kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\txt2img.py", line 123,
 in txt2img_function
    processed = processing.process_images(p)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 8
17, in process_images
    res = process_images_inner(p)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 9
60, in process_images_inner
    samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, s
eeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=
p.prompts)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 1
353, in sample
    return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subsee
d_strength, prompts)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 1
462, in sample_hr_pass
    decoded_samples = decode_latent_batch(self.sd_model, samples, target_device=
devices.cpu, check_for_nans=True)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 6
27, in decode_latent_batch
    samples_pytorch = decode_first_stage(model, batch).to(target_device)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\sd_samplers_common.py"
, line 82, in decode_first_stage
    return samples_to_images_tensor(x, approx_index, model)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\sd_samplers_common.py"
, line 65, in samples_to_images_tensor
    x_sample = model.decode_first_stage(sample)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\diffusion_engine\sdxl.
py", line 132, in decode_first_stage
    sample = self.forge_objects.vae.decode(sample).movedim(-1, 1) * 2.0 - 1.0
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\patcher\vae.py", line
153, in decode
    return self.decode_inner(samples_in)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\patcher\vae.py", line
142, in decode_inner
    pixel_samples[x:x + batch_number] = torch.clamp((self.first_stage_model.deco
de(samples).to(self.output_device).float() + 1.0) / 2.0, min=0.0, max=1.0)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\nn\vae.py", line 309,
in decode
    x = self.decoder(z)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\nn\vae.py", line 261,
in forward
    h = self.up[i_level].upsample(h)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\nn\vae.py", line 56, i
n forward
    x = self.conv(x)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\operations.py", line 1
70, in forward
    return super()._conv_forward(x, weight, bias)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\route_utils.py", line 285, in call_process_api
    output = await app.get_blocks().process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\utils.py", line 818, in wrapper
    response = f(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\call_queue.py", line 9
0, in f
    devices.torch_gc()
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\devices.py", line 39,
in torch_gc
    memory_management.soft_empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\memory_management.py",
 line 1197, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\cuda\memory.py", line 159, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\route_utils.py", line 285, in call_process_api
    output = await app.get_blocks().process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\utils.py", line 818, in wrapper
    response = f(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\call_queue.py", line 9
0, in f
    devices.torch_gc()
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\devices.py", line 39,
in torch_gc
    memory_management.soft_empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\memory_management.py",
 line 1197, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\cuda\memory.py", line 159, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

MisterChief95 · 2025-01-02T00:12:35Z

Can you share your startup logs? Up to the part where you supply command line args, like so:

Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-633-ge073e4ec
Commit hash: e073e4ec581c803cbc71003f6d3261d37ec43840
Total VRAM 24564 MB, total RAM 65298 MB
pytorch version: 2.5.1+cu124
xformers version: 0.0.28.post3
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
Using xformers cross attention
Using xformers attention for VAE
Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run `pip install insightface` manually.
Launching Web UI with arguments: --cuda-malloc --cuda-stream --skip-python-version-check --skip-version-check --skip-google-blockly --pin-shared-memory

DiegoRRR · 2025-01-05T17:33:38Z

Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (
AMD64)]
Version: f2.0.1v1.10.1-previous-535-gb20cb4bf
Commit hash: b20cb4bf0e526f890fcd40a4d039da581cfebafa
Launching Web UI with arguments: --ckpt-dir 'E:\StableDiffusion\'
You are using PyTorch below version 2.3. Some optimizations will be disabled.
Total VRAM 12288 MB, total RAM 65445 MB
pytorch version: 2.1.2+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\transformers
\utils\hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and w
ill be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\onnxruntime\
capi\onnxruntime_validation.py:26: UserWarning: Unsupported Windows version (7).
 ONNX Runtime supports Windows 10 and above, only.
  warnings.warn(
Using pytorch cross attention
Using pytorch attention for VAE
==============================================================================
You are running torch 2.1.2+cu118.
The program is tested to work with torch 2.3.1.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
==============================================================================
ControlNet preprocessor location: D:\apps\stable-diffusion\Forge_2024\webui\mode
ls\ControlNetPreprocessor
[-] ADetailer initialized. version: 24.11.1, num models: 12
2025-01-05 18:36:58,643 - ControlNet - INFO - ControlNet UI callback registered.

Model selected: {'checkpoint_info': {'filename': 'E:\\StableDiffusion\\xl\\toon_
illustrious__zukiCuteILL_v25.safetensors', 'hash': '22c26081'}, 'additional_modu
les': [], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 34.2s (prepare environment: 6.1s, import torch: 10.7s, initialize
shared: 0.2s, other imports: 0.7s, load scripts: 6.7s, create ui: 7.6s, gradio l
aunch: 2.1s).

MisterChief95 · 2025-01-06T03:08:21Z

Couple more questions:

Are you trying to run Flux or some other model?
could tell me what value you have the GPU Weights slider set to at the top of the screen? Looks like this:

That said, there are a few things you've got going on dependency/software-wise that could be contributing:

If possible upgrade your Torch version to at least 2.3.1
That ONNX error mentions you're using Windows 7. Could be a false positive, but if that's the case:
- Python 3.10 is not officially supported on that OS
- The latest available Nvidia Driver is v475.14

A combination of these could be causing some problems.

DiegoRRR · 2025-01-08T20:12:17Z

I use XL Illustrious models.
GPU Weights says "11264". I was in "sd" mode though so I could not find it, I had to pick "xl" mode.
I can't upgrade Torch anymore, neither Cuda or the driver, they all dropped support for Windows 7. (I can't upgrate Windows or switch to Linux for now due to compatibility issues with other software and hardware on this system.)
On their repository they say they fixed the incompatibility but the error will still show as a false positive.

I remember I used to have a similar error when I had the page file disabled, it was fixed when I enabled the page file. I checked, it is still enabled so I don't know why the error came back.

MisterChief95 · 2025-01-08T21:35:15Z

My suggestion is to drop GPU Weights to ~8,000. Reason being, you have it set to allow the model to use 11,264MB of VRAM but that leaves almost none left for inference (which needs ~1.5GB - double that for HiResFix). SDXL models only need ~7GB of VRAM + however much for LoRAs. Another component to this, is Windows itself reserves ~0.5-1GB of VRAM which doesn't help, so its going to be a bit of a balancing game.

DiegoRRR · 2025-01-08T23:57:32Z

It seems to have fixed the issue, I've been running it for two hours with no crash anymore.
Thanks for you help, you are awesome.

DiegoRRR closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: out of memory #2524

RuntimeError: CUDA error: out of memory #2524

DiegoRRR commented Dec 31, 2024 •

edited

Loading

MisterChief95 commented Jan 2, 2025

DiegoRRR commented Jan 5, 2025

MisterChief95 commented Jan 6, 2025

DiegoRRR commented Jan 8, 2025

MisterChief95 commented Jan 8, 2025

DiegoRRR commented Jan 8, 2025

RuntimeError: CUDA error: out of memory #2524

RuntimeError: CUDA error: out of memory #2524

Comments

DiegoRRR commented Dec 31, 2024 • edited Loading

MisterChief95 commented Jan 2, 2025

DiegoRRR commented Jan 5, 2025

MisterChief95 commented Jan 6, 2025

DiegoRRR commented Jan 8, 2025

MisterChief95 commented Jan 8, 2025

DiegoRRR commented Jan 8, 2025

DiegoRRR commented Dec 31, 2024 •

edited

Loading