Replies: 1 comment
-
This is how split points are calculated: https://github.com/ggml-org/llama.cpp/blob/master/src/llama-model.cpp#L1528-L1552 and this is how layers are assigned to devices: https://github.com/ggml-org/llama.cpp/blob/master/src/llama-model.cpp#L1560-L1583 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I went through ggml-rpc.h, ggml-backend.h, ggml-backend-impl.h, and ggml-rpc.cpp but I wasn't able to find where the code for sharding the model. Could someone explain where and how the model is split on the client-end and offloaded to rpc-servers?
I assume it's using gguf-split to a certain extent, but where exactly?
Beta Was this translation helpful? Give feedback.
All reactions