Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepseek 671B unable to run locally (Flatpak) #510

Open
privacyadmin opened this issue Feb 1, 2025 · 6 comments
Open

Deepseek 671B unable to run locally (Flatpak) #510

privacyadmin opened this issue Feb 1, 2025 · 6 comments

Comments

@privacyadmin
Copy link

Hi,

I encountered the following error when I try to run DeepSeek 671B on my system.


user@fedora:~$ flatpak run com.jeffser.Alpaca
INFO [main.py | main] Alpaca version: 4.0.0
INFO [connection_handler.py | start] Starting Alpaca's Ollama instance...
INFO [connection_handler.py | start] Started Alpaca's Ollama instance
INFO [connection_handler.py | start] client version is 0.5.7
ERROR [window.py | run_message] ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Exception in thread Thread-5 (run_message):
Traceback (most recent call last):
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 793, in urlopen
ERROR [window.py | generate_chat_title] ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
response.begin()
File "/usr/lib/python3.12/http/client.py", line 331, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
response.begin()
File "/usr/lib/python3.12/http/client.py", line 331, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/share/Alpaca/alpaca/window.py", line 670, in run_message
response = self.ollama_instance.request("POST", "api/chat", json.dumps(data), lambda data, message_element=message_element: message_element.update_message(data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/share/Alpaca/alpaca/connection_handler.py", line 82, in request
response = requests.post(connection_url, headers=self.get_headers(True), data=data, stream=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/app/share/Alpaca/alpaca/window.py", line 675, in run_message
raise Exception(e)
Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))


I am using the integrated Ollama instance which was shown Running. No changes or modifications were made in the Ollama Instance section.

System Specifications:
GPU: 4090
RAM: 768GB
OS: Fedora 41 Gnome

I tested with a smaller model (Qwen2 72B) and has no issue generating a response with no errors. This may be due to it fitting into my 4090 (99% utilization) and not spilling over to system RAM whereas Deepseek 671B cannot.

Is there a way to disable loading of models into VRAM and into system RAM only to test this?

@privacyadmin
Copy link
Author

Sorry, just noticed there was a debugger function. Please refer to below.

Couldn't find '/home/user/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIObIEWaCEq49QSa3EgMEFudE9WqAhyBh9rfrPK6Zt/XX

2025/02/01 15:35:36 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-02-01T15:35:36.462+08:00 level=INFO source=images.go:432 msg="total blobs: 11"
time=2025-02-01T15:35:36.462+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

  • using env: export GIN_MODE=release
  • using code: gin.SetMode(gin.ReleaseMode)

time=2025-02-01T15:35:36.462+08:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11435 (version 0.5.7)"
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-01T15:35:36.463+08:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
time=2025-02-01T15:35:36.463+08:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-02-01T15:35:36.798+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-aba95439-9f10-0dc7-c0e8-0c959db9b0a5 library=cuda variant=v11 compute=8.9 driver=0.0 name="" total="23.5 GiB" available="22.7 GiB"
[GIN] 2025/02/01 - 15:35:36 | 200 | 408.876µs | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/02/01 - 15:35:36 | 200 | 20.412271ms | 127.0.0.1 | POST "/api/show"
[GIN] 2025/02/01 - 15:35:36 | 200 | 21.982926ms | 127.0.0.1 | POST "/api/show"
time=2025-02-01T15:35:48.951+08:00 level=INFO source=server.go:104 msg="system memory" total="754.9 GiB" free="746.7 GiB" free_swap="8.0 GiB"
time=2025-02-01T15:35:48.951+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB"
time=2025-02-01T15:35:48.952+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 96 --parallel 1 --port 41315"
time=2025-02-01T15:35:48.960+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-01T15:35:48.960+08:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-02-01T15:35:48.960+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-02-01T15:35:48.991+08:00 level=INFO source=runner.go:936 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2025-02-01T15:35:49.040+08:00 level=INFO source=runner.go:937 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=96
time=2025-02-01T15:35:49.040+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:41315"
time=2025-02-01T15:35:49.212+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22986 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 256x20B
llama_model_loader: - kv 3: deepseek2.block_count u32 = 61
llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3

@olumolu
Copy link
Contributor

olumolu commented Feb 1, 2025

can you tell about the system configuration that you have

@privacyadmin
Copy link
Author

Sure.

System Specifications:
CPU: 7995wx
GPU: 4090
RAM: 768GB
OS: Fedora 41 Gnome

I also tested with Deepseek 70b which is about 40GB and I can run it successfully with my VRAM (20GB to 22GB used) and the remainder overflow properly to my system RAM (~22GB).

@privacyadmin
Copy link
Author

Image

This is with Llama3.3 70b which is about 75GB.

Deepseek 671b is only about 400GB which should still be manageable under my RAM capacity.

@olumolu
Copy link
Contributor

olumolu commented Feb 1, 2025

Have you tried with ollama directly.

@CodingKoalaGeneral
Copy link

Have you tried with ollama directly?

would interest me as well

A feature to disable CPU fallback (GPU only) or force CPU-only (globally / each model) usage would be handy. Occasionally, the app partially loads models into VRAM, then fails (until app restart clears the VRAM) and switches to CPU, requiring manual termination repeatedly. Not tested the latest releases in that regard)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants