Exo loses nodes during LLM download and does not distribute workload #775

tk-86 · 2025-03-10T15:11:30Z

I have four Mac Mini M4 Pro devices, and I want to set up a cluster to run a large language model (LLM) using Exo. I installed Exo in a Conda virtual environment on all my Mac Minis, following this video. I can connect to all nodes via SSH from the main Mac Mini using SSH keys.

At the beginning, all four nodes are detected, and a cluster is successfully created on the main Mac Mini. However, when I start downloading a small LLM, the nodes gradually disappear:

First, all four nodes are listed.
Then, one disappears, leaving three.
Then, another disappears, leaving two.
In the end, only the main Mac Mini remains, and the LLM is downloaded only there.

My Mac Minis are connected to a local network via Ethernet using a switch, while they access the internet over Wi-Fi. They all have IP addresses.

I have three main questions:

Why does Exo lose nodes during LLM downloading?
Why does Exo install the large LLM only on one Mac Mini instead of distributing it across all four? Is there a specific configuration needed for multi-node LLM deployment?
Are these two issues related?

I would appreciate any guidance on setting up Exo for distributed LLM inference across multiple nodes. Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exo loses nodes during LLM download and does not distribute workload #775

Exo loses nodes during LLM download and does not distribute workload #775

tk-86 commented Mar 10, 2025

Exo loses nodes during LLM download and does not distribute workload #775

Exo loses nodes during LLM download and does not distribute workload #775

Comments

tk-86 commented Mar 10, 2025