Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"model is only supported on GPU" since v1.8.0 #468

Closed
alexkramer98 opened this issue Dec 12, 2024 · 4 comments
Closed

"model is only supported on GPU" since v1.8.0 #468

alexkramer98 opened this issue Dec 12, 2024 · 4 comments

Comments

@alexkramer98
Copy link

alexkramer98 commented Dec 12, 2024

Since v1.8.0, models are not loading and I am seeing this in the logs:

-- 402 -- /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
-- 402 --   return torch._C._cuda_getDeviceCount() > 0
-- 402 -- /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
-- 402 --   def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
-- 402 -- /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
-- 402 --   def backward(ctx, grad_output):
-- 402 -- /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
-- 402 --   @custom_fwd(cast_inputs=torch.float16)
-- 402 -- 20241212 09:38:10 MODEL STATUS loading model
-- 402 -- Traceback (most recent call last):
-- 402 --   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
-- 402 --     return _run_code(code, main_globals, None,
-- 402 --   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
-- 402 --     exec(code, run_globals)
-- 402 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_worker.py", line 154, in <module>
-- 402 --     worker_loop(args.model, models_mini_db, supported_models.config, compile=args.compile)
-- 402 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_worker.py", line 51, in worker_loop
-- 402 --     inference_model = InferenceHF(
-- 402 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_hf.py", line 148, in __init__
-- 402 --     assert torch.cuda.is_available(), "model is only supported on GPU"
-- 402 -- AssertionError: model is only supported on GPU
-- 294 -- 20241212 09:38:10 WEBUI 172.17.0.1:53830 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 294 -- 20241212 09:38:10 WEBUI 172.17.0.1:53846 - "GET /tab-finetune-config-and-runs HTTP/1.1" 200
20241212 09:38:11 402 finished python -m self_hosting_machinery.inference.inference_worker --model qwen2.5/coder/1.5b/base @:gpu00, retcode 1
/finished compiling -- failed, probably unrecoverable, will not retry

I have an RTX4060 (Laptop), if that matters at all. v1.7.0 worked fine.

@mitya52
Copy link
Member

mitya52 commented Dec 23, 2024

@alexkramer98 hi! Looks like you have problem with cuda. Try to upgrade your driver to 525.147.05 or higher.

@mitya52
Copy link
Member

mitya52 commented Jan 16, 2025

@alexkramer98 did you updated nvidia drivers? It should help with the issue

@alexkramer98
Copy link
Author

Hi @mitya52,

I am on driver 535.216.03. When I downgrade to Refact v1.7.0 everything works as expected.
Further on in the log, I see this too:

-- 310 -- WARNING:root:output was:
-- 310 -- - no output -
-- 310 -- WARNING:root:nvidia-smi does not work, that's especially bad for initial setup.
-- 310 -- WARNING:root:Traceback (most recent call last):
-- 310 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/scripts/enum_gpus.py", line 17, in query_nvidia_smi
-- 310 --     nvidia_smi_output = subprocess.check_output([
-- 310 --   File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
-- 310 --     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
-- 310 --   File "/usr/lib/python3.10/subprocess.py", line 526, in run
-- 310 --     raise CalledProcessError(retcode, process.args,
-- 310 -- subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu', '--format=csv']' returned non-zero exit status 255.

However, when I run docker exec -it [container_id] nvidia-smi, I get the correct output:

Fri Jan 17 09:59:16 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P5               9W /  35W |    735MiB /  8188MiB |     14%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

@alexkramer98
Copy link
Author

Solved by upgrading my driver to 550.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants