"model is only supported on GPU" since v1.8.0 #468

alexkramer98 · 2024-12-12T09:44:17Z

Since v1.8.0, models are not loading and I am seeing this in the logs:

-- 402 -- /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
-- 402 --   return torch._C._cuda_getDeviceCount() > 0
-- 402 -- /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
-- 402 --   def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
-- 402 -- /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
-- 402 --   def backward(ctx, grad_output):
-- 402 -- /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
-- 402 --   @custom_fwd(cast_inputs=torch.float16)
-- 402 -- 20241212 09:38:10 MODEL STATUS loading model
-- 402 -- Traceback (most recent call last):
-- 402 --   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
-- 402 --     return _run_code(code, main_globals, None,
-- 402 --   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
-- 402 --     exec(code, run_globals)
-- 402 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_worker.py", line 154, in <module>
-- 402 --     worker_loop(args.model, models_mini_db, supported_models.config, compile=args.compile)
-- 402 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_worker.py", line 51, in worker_loop
-- 402 --     inference_model = InferenceHF(
-- 402 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_hf.py", line 148, in __init__
-- 402 --     assert torch.cuda.is_available(), "model is only supported on GPU"
-- 402 -- AssertionError: model is only supported on GPU
-- 294 -- 20241212 09:38:10 WEBUI 172.17.0.1:53830 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 294 -- 20241212 09:38:10 WEBUI 172.17.0.1:53846 - "GET /tab-finetune-config-and-runs HTTP/1.1" 200
20241212 09:38:11 402 finished python -m self_hosting_machinery.inference.inference_worker --model qwen2.5/coder/1.5b/base @:gpu00, retcode 1
/finished compiling -- failed, probably unrecoverable, will not retry

I have an RTX4060 (Laptop), if that matters at all. v1.7.0 worked fine.

The text was updated successfully, but these errors were encountered:

mitya52 · 2024-12-23T16:23:20Z

@alexkramer98 hi! Looks like you have problem with cuda. Try to upgrade your driver to 525.147.05 or higher.

mitya52 · 2025-01-16T13:39:34Z

@alexkramer98 did you updated nvidia drivers? It should help with the issue

alexkramer98 · 2025-01-17T10:02:48Z

Hi @mitya52,

I am on driver 535.216.03. When I downgrade to Refact v1.7.0 everything works as expected.
Further on in the log, I see this too:

-- 310 -- WARNING:root:output was:
-- 310 -- - no output -
-- 310 -- WARNING:root:nvidia-smi does not work, that's especially bad for initial setup.
-- 310 -- WARNING:root:Traceback (most recent call last):
-- 310 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/scripts/enum_gpus.py", line 17, in query_nvidia_smi
-- 310 --     nvidia_smi_output = subprocess.check_output([
-- 310 --   File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
-- 310 --     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
-- 310 --   File "/usr/lib/python3.10/subprocess.py", line 526, in run
-- 310 --     raise CalledProcessError(retcode, process.args,
-- 310 -- subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu', '--format=csv']' returned non-zero exit status 255.

However, when I run docker exec -it [container_id] nvidia-smi, I get the correct output:

Fri Jan 17 09:59:16 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P5               9W /  35W |    735MiB /  8188MiB |     14%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

alexkramer98 · 2025-01-28T13:20:12Z

Solved by upgrading my driver to 550.

alexkramer98 closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"model is only supported on GPU" since v1.8.0 #468

"model is only supported on GPU" since v1.8.0 #468

alexkramer98 commented Dec 12, 2024 •

edited

Loading

mitya52 commented Dec 23, 2024

mitya52 commented Jan 16, 2025

alexkramer98 commented Jan 17, 2025

alexkramer98 commented Jan 28, 2025

"model is only supported on GPU" since v1.8.0 #468

"model is only supported on GPU" since v1.8.0 #468

Comments

alexkramer98 commented Dec 12, 2024 • edited Loading

mitya52 commented Dec 23, 2024

mitya52 commented Jan 16, 2025

alexkramer98 commented Jan 17, 2025

alexkramer98 commented Jan 28, 2025

alexkramer98 commented Dec 12, 2024 •

edited

Loading