Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can someone help this noob (2 node cluster unresponsive) #788

Open
NotReallyADeveloper opened this issue Mar 17, 2025 · 4 comments
Open

Comments

@NotReallyADeveloper
Copy link

NotReallyADeveloper commented Mar 17, 2025

Am trying to cluster a Mac Mini pro and a Mac Mini Studio Pro Max, connected via an Apple Thunderbolt 5 cable.

Nothing wrong with the connectivity, and can use Llama 3.2 3B without issues when using a single computer. But as soon as I add a second node, the LLM stops responding completely.

MLX is installed and configured on both.

What am I missing here?

@Sunchy389
Copy link

I have the same question. how to solve it?

@hefish
Copy link

hefish commented Mar 20, 2025

i have the same . logs show that grpc has some error. so node communications may have troubles.

Traceback (most recent call last):
File "/Users/hefish/works/exo/exo/orchestration/node.py", line 606, in send_status_to_peer
await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0)
File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/Users/hefish/works/exo/exo/networking/grpc/grpc_peer_handle.py", line 204, in send_opaque_status
await asyncio.wait_for(self.stub.SendOpaqueStatus(request), timeout=10.0)
File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in
await
raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Received RST_STREAM with error code 7"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error
code 7", grpc_status:14, created_time:"2025-03-20T08:29:02.634141+08:00"}"

@hefish
Copy link

hefish commented Mar 20, 2025

我发现把 grpc, grpc-tools 两个模块降级到 1.70 ,即可解决。 查看一下最新版的grpc,是2025年3月11日更新到1.71的。所以我回想起来 2月份的时候还是好的。 看起来是grpc的组件升级导致的。

@Sunchy389
Copy link

降级是这个语句么 "grpcio==1.67.0",
"grpcio-tools==1.67.0",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants