Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: out of memory #2524

Closed
DiegoRRR opened this issue Dec 31, 2024 · 6 comments
Closed

RuntimeError: CUDA error: out of memory #2524

DiegoRRR opened this issue Dec 31, 2024 · 6 comments

Comments

@DiegoRRR
Copy link

DiegoRRR commented Dec 31, 2024

Sometimes it works for hours without having this issue, and sometimes I get this crash every few minutes...
I don't run in batch though, only one pic at a time, the resolution is lower than 1024x1024, my GPU has 12GB of VRAM which is supposed to be more than enough for SDXL. Also I tried adding --medvram to COMMANDLINE_ARGS but this did not solve the issue.
Can someone help, please?

(...)
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Curr
ent free memory is 3516.46 MB ... Done.
100%|██████████████████████████████████████████| 26/26 [00:15<00:00,  1.66it/s]
[Unload] Trying to free 3369.09 MB for cuda:0 with 1 models keep loaded ... Curr
ent free memory is 3506.33 MB ... Done.
Exception in thread MemMon:
Traceback (most recent call last):
  File "threading.py", line 1016, in _bootstrap_inner
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\memmon.py", line 53, i
n run
    free, total = self.cuda_mem_get_info()
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\memmon.py", line 34, i
n cuda_mem_get_info
    return torch.cuda.mem_get_info(index)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\cuda\memory.py", line 663, in mem_get_info
    return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules_forge\main_thread.py",
 line 30, in work
    self.result = self.func(*self.args, **self.kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\txt2img.py", line 123,
 in txt2img_function
    processed = processing.process_images(p)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 8
17, in process_images
    res = process_images_inner(p)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 9
60, in process_images_inner
    samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, s
eeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=
p.prompts)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 1
353, in sample
    return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subsee
d_strength, prompts)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 1
462, in sample_hr_pass
    decoded_samples = decode_latent_batch(self.sd_model, samples, target_device=
devices.cpu, check_for_nans=True)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\processing.py", line 6
27, in decode_latent_batch
    samples_pytorch = decode_first_stage(model, batch).to(target_device)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\sd_samplers_common.py"
, line 82, in decode_first_stage
    return samples_to_images_tensor(x, approx_index, model)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\sd_samplers_common.py"
, line 65, in samples_to_images_tensor
    x_sample = model.decode_first_stage(sample)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\diffusion_engine\sdxl.
py", line 132, in decode_first_stage
    sample = self.forge_objects.vae.decode(sample).movedim(-1, 1) * 2.0 - 1.0
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\patcher\vae.py", line
153, in decode
    return self.decode_inner(samples_in)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\patcher\vae.py", line
142, in decode_inner
    pixel_samples[x:x + batch_number] = torch.clamp((self.first_stage_model.deco
de(samples).to(self.output_device).float() + 1.0) / 2.0, min=0.0, max=1.0)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\nn\vae.py", line 309,
in decode
    x = self.decoder(z)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\nn\vae.py", line 261,
in forward
    h = self.up[i_level].upsample(h)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\nn\vae.py", line 56, i
n forward
    x = self.conv(x)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\operations.py", line 1
70, in forward
    return super()._conv_forward(x, weight, bias)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\route_utils.py", line 285, in call_process_api
    output = await app.get_blocks().process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\utils.py", line 818, in wrapper
    response = f(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\call_queue.py", line 9
0, in f
    devices.torch_gc()
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\devices.py", line 39,
in torch_gc
    memory_management.soft_empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\memory_management.py",
 line 1197, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\cuda\memory.py", line 159, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\route_utils.py", line 285, in call_process_api
    output = await app.get_blocks().process_api(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\anyi
o\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\grad
io\utils.py", line 818, in wrapper
    response = f(*args, **kwargs)
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\call_queue.py", line 9
0, in f
    devices.torch_gc()
  File "D:\apps\stable-diffusion\Forge_2024\webui\modules\devices.py", line 39,
in torch_gc
    memory_management.soft_empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\webui\backend\memory_management.py",
 line 1197, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\torc
h\cuda\memory.py", line 159, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so t
he stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@MisterChief95
Copy link

Can you share your startup logs? Up to the part where you supply command line args, like so:

Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-633-ge073e4ec
Commit hash: e073e4ec581c803cbc71003f6d3261d37ec43840
Total VRAM 24564 MB, total RAM 65298 MB
pytorch version: 2.5.1+cu124
xformers version: 0.0.28.post3
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
Using xformers cross attention
Using xformers attention for VAE
Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run `pip install insightface` manually.
Launching Web UI with arguments: --cuda-malloc --cuda-stream --skip-python-version-check --skip-version-check --skip-google-blockly --pin-shared-memory

@DiegoRRR
Copy link
Author

DiegoRRR commented Jan 5, 2025

Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (
AMD64)]
Version: f2.0.1v1.10.1-previous-535-gb20cb4bf
Commit hash: b20cb4bf0e526f890fcd40a4d039da581cfebafa
Launching Web UI with arguments: --ckpt-dir 'E:\StableDiffusion\'
You are using PyTorch below version 2.3. Some optimizations will be disabled.
Total VRAM 12288 MB, total RAM 65445 MB
pytorch version: 2.1.2+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\transformers
\utils\hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and w
ill be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
D:\apps\stable-diffusion\Forge_2024\system\python\lib\site-packages\onnxruntime\
capi\onnxruntime_validation.py:26: UserWarning: Unsupported Windows version (7).
 ONNX Runtime supports Windows 10 and above, only.
  warnings.warn(
Using pytorch cross attention
Using pytorch attention for VAE
==============================================================================
You are running torch 2.1.2+cu118.
The program is tested to work with torch 2.3.1.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
==============================================================================
ControlNet preprocessor location: D:\apps\stable-diffusion\Forge_2024\webui\mode
ls\ControlNetPreprocessor
[-] ADetailer initialized. version: 24.11.1, num models: 12
2025-01-05 18:36:58,643 - ControlNet - INFO - ControlNet UI callback registered.

Model selected: {'checkpoint_info': {'filename': 'E:\\StableDiffusion\\xl\\toon_
illustrious__zukiCuteILL_v25.safetensors', 'hash': '22c26081'}, 'additional_modu
les': [], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 34.2s (prepare environment: 6.1s, import torch: 10.7s, initialize
shared: 0.2s, other imports: 0.7s, load scripts: 6.7s, create ui: 7.6s, gradio l
aunch: 2.1s).

@MisterChief95
Copy link

Couple more questions:

  1. Are you trying to run Flux or some other model?
  2. could tell me what value you have the GPU Weights slider set to at the top of the screen? Looks like this:
    image

That said, there are a few things you've got going on dependency/software-wise that could be contributing:

  1. If possible upgrade your Torch version to at least 2.3.1
  2. That ONNX error mentions you're using Windows 7. Could be a false positive, but if that's the case:
    • Python 3.10 is not officially supported on that OS
    • The latest available Nvidia Driver is v475.14

A combination of these could be causing some problems.

@DiegoRRR
Copy link
Author

DiegoRRR commented Jan 8, 2025

  1. I use XL Illustrious models.

  2. GPU Weights says "11264". I was in "sd" mode though so I could not find it, I had to pick "xl" mode.

  3. I can't upgrade Torch anymore, neither Cuda or the driver, they all dropped support for Windows 7. (I can't upgrate Windows or switch to Linux for now due to compatibility issues with other software and hardware on this system.)

  4. On their repository they say they fixed the incompatibility but the error will still show as a false positive.

I remember I used to have a similar error when I had the page file disabled, it was fixed when I enabled the page file. I checked, it is still enabled so I don't know why the error came back.

@MisterChief95
Copy link

My suggestion is to drop GPU Weights to ~8,000. Reason being, you have it set to allow the model to use 11,264MB of VRAM but that leaves almost none left for inference (which needs ~1.5GB - double that for HiResFix). SDXL models only need ~7GB of VRAM + however much for LoRAs. Another component to this, is Windows itself reserves ~0.5-1GB of VRAM which doesn't help, so its going to be a bit of a balancing game.

@DiegoRRR
Copy link
Author

DiegoRRR commented Jan 8, 2025

It seems to have fixed the issue, I've been running it for two hours with no crash anymore.
Thanks for you help, you are awesome.

@DiegoRRR DiegoRRR closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants