Deepseek 671B unable to run locally (Flatpak) #510

privacyadmin · 2025-02-01T07:32:06Z

Hi,

I encountered the following error when I try to run DeepSeek 671B on my system.

user@fedora:~$ flatpak run com.jeffser.Alpaca
INFO [main.py | main] Alpaca version: 4.0.0
INFO [connection_handler.py | start] Starting Alpaca's Ollama instance...
INFO [connection_handler.py | start] Started Alpaca's Ollama instance
INFO [connection_handler.py | start] client version is 0.5.7
ERROR [window.py | run_message] ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Exception in thread Thread-5 (run_message):
Traceback (most recent call last):
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 793, in urlopen
ERROR [window.py | generate_chat_title] ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
response.begin()
File "/usr/lib/python3.12/http/client.py", line 331, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
response.begin()
File "/usr/lib/python3.12/http/client.py", line 331, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/share/Alpaca/alpaca/window.py", line 670, in run_message
response = self.ollama_instance.request("POST", "api/chat", json.dumps(data), lambda data, message_element=message_element: message_element.update_message(data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/share/Alpaca/alpaca/connection_handler.py", line 82, in request
response = requests.post(connection_url, headers=self.get_headers(True), data=data, stream=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/python3.12/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/app/share/Alpaca/alpaca/window.py", line 675, in run_message
raise Exception(e)
Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I am using the integrated Ollama instance which was shown Running. No changes or modifications were made in the Ollama Instance section.

System Specifications:
GPU: 4090
RAM: 768GB
OS: Fedora 41 Gnome

I tested with a smaller model (Qwen2 72B) and has no issue generating a response with no errors. This may be due to it fitting into my 4090 (99% utilization) and not spilling over to system RAM whereas Deepseek 671B cannot.

Is there a way to disable loading of models into VRAM and into system RAM only to test this?

privacyadmin · 2025-02-01T07:42:32Z

Sorry, just noticed there was a debugger function. Please refer to below.

Couldn't find '/home/user/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIObIEWaCEq49QSa3EgMEFudE9WqAhyBh9rfrPK6Zt/XX

2025/02/01 15:35:36 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-02-01T15:35:36.462+08:00 level=INFO source=images.go:432 msg="total blobs: 11"
time=2025-02-01T15:35:36.462+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

using env: export GIN_MODE=release
using code: gin.SetMode(gin.ReleaseMode)

time=2025-02-01T15:35:36.462+08:00 [GIN-debug] POST /api/pull [GIN-debug] POST /api/generate [GIN-debug] POST /api/chat [GIN-debug] POST /api/embed [GIN-debug] POST /api/embeddings [GIN-debug] POST /api/create [GIN-debug] POST /api/push [GIN-debug] POST /api/copy [GIN-debug] DELETE /api/delete [GIN-debug] POST /api/show [GIN-debug] POST /api/blobs/:digest [GIN-debug] HEAD /api/blobs/:digest [GIN-debug] GET /api/ps [GIN-debug] POST /v1/chat/completions [GIN-debug] POST /v1/completions [GIN-debug] POST /v1/embeddings [GIN-debug] GET /v1/models [GIN-debug] GET /v1/models/:model [GIN-debug] GET / [GIN-debug] GET /api/tags [GIN-debug] GET /api/version [GIN-debug] HEAD / [GIN-debug] HEAD /api/tags [GIN-debug] HEAD /api/version time=2025-02-01T15:35:36.463+08:00 time=2025-02-01T15:35:36.463+08:00 time=2025-02-01T15:35:36.798+08:00 [GIN] 2025/02/01 - 15:35:36 | 200 | [GIN] 2025/02/01 - 15:35:36 | 200 | [GIN] 2025/02/01 - 15:35:36 | 200 | time=2025-02-01T15:35:48.951+08:00 time=2025-02-01T15:35:48.951+08:00 time=2025-02-01T15:35:48.952+08:00 time=2025-02-01T15:35:48.960+08:00 time=2025-02-01T15:35:48.960+08:00 time=2025-02-01T15:35:48.960+08:00 time=2025-02-01T15:35:48.991+08:00 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: ggml_cuda_init: found Device 0: NVIDIA GeForce time=2025-02-01T15:35:49.040+08:00 time=2025-02-01T15:35:49.040+08:00 time=2025-02-01T15:35:49.212+08:00 llama_load_model_from_file: llama_model_loader: llama_model_loader: llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - kv 20: llama_model_loader: - kv 21: llama_model_loader: - kv 22: llama_model_loader: - kv 23: llama_model_loader: - kv 24: llama_model_loader: - kv 25: llama_model_loader: - kv 26: llama_model_loader: - kv llama_model_loader: - kv llama_model_loader: - kv 29: llama_model_loader: - kv 30: level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11435 (version 0.5.7)"
--> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
--> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
--> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
--> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
--> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
--> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
--> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
--> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
--> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
--> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
level=INFO source=types.go:131 msg="inference compute" id=GPU-aba95439-9f10-0dc7-c0e8-0c959db9b0a5 library=cuda variant=v11 compute=8.9 driver=0.0 name="" total="23.5 GiB" available="22.7 GiB"
408.876µs | 127.0.0.1 | GET "/api/tags"
20.412271ms | 127.0.0.1 | POST "/api/show"
21.982926ms | 127.0.0.1 | POST "/api/show"
level=INFO source=server.go:104 msg="system memory" total="754.9 GiB" free="746.7 GiB" free_swap="8.0 GiB"
level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB"
level=INFO source=server.go:376 msg="starting llama server" cmd="/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 96 --parallel 1 --port 41315"
level=INFO source=sched.go:449 msg="loaded runners" count=1
level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
level=INFO source=runner.go:936 msg="starting go runner"
no
no
1 CUDA devices:
RTX 4090, compute capability 8.9, VMM: yes
level=INFO source=runner.go:937 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=96
level=INFO source=.:0 msg="Server listening on 127.0.0.1:41315"
level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
using device CUDA0 (NVIDIA GeForce RTX 4090) - 22986 MiB free
loaded meta data with 42 key-value pairs and 1025 tensors from /home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest))
Dumping metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = deepseek2
general.type str = model
general.size_label str = 256x20B
deepseek2.block_count u32 = 61
deepseek2.context_length u32 = 163840
deepseek2.embedding_length u32 = 7168
deepseek2.feed_forward_length u32 = 18432
deepseek2.attention.head_count u32 = 128
deepseek2.attention.head_count_kv u32 = 128
deepseek2.rope.freq_base f32 = 10000.000000
10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
deepseek2.expert_used_count u32 = 8
deepseek2.leading_dense_block_count u32 = 3
deepseek2.vocab_size u32 = 129280
deepseek2.attention.q_lora_rank u32 = 1536
deepseek2.attention.kv_lora_rank u32 = 512
deepseek2.attention.key_length u32 = 192
deepseek2.attention.value_length u32 = 128
deepseek2.expert_feed_forward_length u32 = 2048
deepseek2.expert_count u32 = 256
deepseek2.expert_shared_count u32 = 1
deepseek2.expert_weights_scale f32 = 2.500000
deepseek2.expert_weights_norm bool = true
deepseek2.expert_gating_func u32 = 2
deepseek2.rope.dimension_count u32 = 64
deepseek2.rope.scaling.type str = yarn
deepseek2.rope.scaling.factor f32 = 40.000000
27: deepseek2.rope.scaling.original_context_length u32 = 4096
28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
tokenizer.ggml.model str = gpt2
tokenizer.ggml.pre str = deepseek-v3

olumolu · 2025-02-01T07:59:32Z

can you tell about the system configuration that you have

privacyadmin · 2025-02-01T08:32:35Z

Sure.

System Specifications:
CPU: 7995wx
GPU: 4090
RAM: 768GB
OS: Fedora 41 Gnome

I also tested with Deepseek 70b which is about 40GB and I can run it successfully with my VRAM (20GB to 22GB used) and the remainder overflow properly to my system RAM (~22GB).

privacyadmin · 2025-02-01T08:50:09Z

This is with Llama3.3 70b which is about 75GB.

Deepseek 671b is only about 400GB which should still be manageable under my RAM capacity.

olumolu · 2025-02-01T10:30:19Z

Have you tried with ollama directly.

CodingKoalaGeneral · 2025-02-04T00:48:16Z

Have you tried with ollama directly?

would interest me as well

A feature to disable CPU fallback (GPU only) or force CPU-only (globally / each model) usage would be handy. Occasionally, the app partially loads models into VRAM, then fails (until app restart clears the VRAM) and switches to CPU, requiring manual termination repeatedly. Not tested the latest releases in that regard)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek 671B unable to run locally (Flatpak) #510

Deepseek 671B unable to run locally (Flatpak) #510

privacyadmin commented Feb 1, 2025

privacyadmin commented Feb 1, 2025

olumolu commented Feb 1, 2025

privacyadmin commented Feb 1, 2025

privacyadmin commented Feb 1, 2025

olumolu commented Feb 1, 2025

CodingKoalaGeneral commented Feb 4, 2025

Deepseek 671B unable to run locally (Flatpak) #510

Deepseek 671B unable to run locally (Flatpak) #510

Comments

privacyadmin commented Feb 1, 2025

privacyadmin commented Feb 1, 2025

olumolu commented Feb 1, 2025

privacyadmin commented Feb 1, 2025

privacyadmin commented Feb 1, 2025

olumolu commented Feb 1, 2025

CodingKoalaGeneral commented Feb 4, 2025