Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exo loses nodes during LLM download and does not distribute workload #775

Open
tk-86 opened this issue Mar 10, 2025 · 0 comments
Open

Exo loses nodes during LLM download and does not distribute workload #775

tk-86 opened this issue Mar 10, 2025 · 0 comments

Comments

@tk-86
Copy link

tk-86 commented Mar 10, 2025

I have four Mac Mini M4 Pro devices, and I want to set up a cluster to run a large language model (LLM) using Exo. I installed Exo in a Conda virtual environment on all my Mac Minis, following this video. I can connect to all nodes via SSH from the main Mac Mini using SSH keys.

At the beginning, all four nodes are detected, and a cluster is successfully created on the main Mac Mini. However, when I start downloading a small LLM, the nodes gradually disappear:

  • First, all four nodes are listed.
  • Then, one disappears, leaving three.
  • Then, another disappears, leaving two.
  • In the end, only the main Mac Mini remains, and the LLM is downloaded only there.

My Mac Minis are connected to a local network via Ethernet using a switch, while they access the internet over Wi-Fi. They all have IP addresses.

I have three main questions:

  1. Why does Exo lose nodes during LLM downloading?
  2. Why does Exo install the large LLM only on one Mac Mini instead of distributing it across all four? Is there a specific configuration needed for multi-node LLM deployment?
  3. Are these two issues related?

I would appreciate any guidance on setting up Exo for distributed LLM inference across multiple nodes. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant