Model Sharding in RPC #14266

aryan8433 · 2025-06-18T16:34:30Z

aryan8433
Jun 18, 2025

I went through ggml-rpc.h, ggml-backend.h, ggml-backend-impl.h, and ggml-rpc.cpp but I wasn't able to find where the code for sharding the model. Could someone explain where and how the model is split on the client-end and offloaded to rpc-servers?
I assume it's using gguf-split to a certain extent, but where exactly?

rgerganov · 2025-06-19T10:18:37Z

rgerganov
Jun 19, 2025
Collaborator

This is how split points are calculated:

https://github.com/ggml-org/llama.cpp/blob/master/src/llama-model.cpp#L1528-L1552

and this is how layers are assigned to devices:

https://github.com/ggml-org/llama.cpp/blob/master/src/llama-model.cpp#L1560-L1583

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Sharding in RPC #14266

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Model Sharding in RPC #14266

Uh oh!

aryan8433 Jun 18, 2025

Replies: 1 comment

Uh oh!

rgerganov Jun 19, 2025 Collaborator

aryan8433
Jun 18, 2025

rgerganov
Jun 19, 2025
Collaborator