You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have four Mac Mini M4 Pro devices, and I want to set up a cluster to run a large language model (LLM) using Exo. I installed Exo in a Conda virtual environment on all my Mac Minis, following this video. I can connect to all nodes via SSH from the main Mac Mini using SSH keys.
At the beginning, all four nodes are detected, and a cluster is successfully created on the main Mac Mini. However, when I start downloading a small LLM, the nodes gradually disappear:
First, all four nodes are listed.
Then, one disappears, leaving three.
Then, another disappears, leaving two.
In the end, only the main Mac Mini remains, and the LLM is downloaded only there.
My Mac Minis are connected to a local network via Ethernet using a switch, while they access the internet over Wi-Fi. They all have IP addresses.
I have three main questions:
Why does Exo lose nodes during LLM downloading?
Why does Exo install the large LLM only on one Mac Mini instead of distributing it across all four? Is there a specific configuration needed for multi-node LLM deployment?
Are these two issues related?
I would appreciate any guidance on setting up Exo for distributed LLM inference across multiple nodes. Thanks in advance!
The text was updated successfully, but these errors were encountered:
I have four Mac Mini M4 Pro devices, and I want to set up a cluster to run a large language model (LLM) using Exo. I installed Exo in a Conda virtual environment on all my Mac Minis, following this video. I can connect to all nodes via SSH from the main Mac Mini using SSH keys.
At the beginning, all four nodes are detected, and a cluster is successfully created on the main Mac Mini. However, when I start downloading a small LLM, the nodes gradually disappear:
My Mac Minis are connected to a local network via Ethernet using a switch, while they access the internet over Wi-Fi. They all have IP addresses.
I have three main questions:
I would appreciate any guidance on setting up Exo for distributed LLM inference across multiple nodes. Thanks in advance!
The text was updated successfully, but these errors were encountered: