How does llama.cpp achieve CPU-GPU hybrid mode? #8146

it starts using it as soon as you put in any number after -ngl, models have a set numbers of layers which is written by the program in the terminal after starting llama.cpp and before you can prompt the model

ChalieChang1028 · 2024-08-30T06:20:54Z

ChalieChang1028
Aug 30, 2024

I have the same question. Is there any more detailed technical information available? So far, I've only found some information in the PowerInfer paper.

0 replies

slaren · 2024-08-30T13:33:32Z

slaren
Aug 30, 2024
Maintainer

Some layers are run on the CPU and others on the GPU, sequentially. You can set the GGML_SCHED_DEBUG environment variable to see what operations are being run on each device.

2 replies

Martin-HZK Oct 8, 2024

So how is the task scheduled between CPU and GPU?

ejrydhfs Feb 13, 2025

i believe this implies that, if most or even some of the layers are run on the CPU then the CPU can be a bottleneck which can be seen as less than full GPU usage during hybrid inference, could there be a way to implement what powerinfer does (hot neurons on GPU and cold neurons on CPU?)

edit from what i can tell, separating hot from cold neurons requires a separate prediction step with predictor files and the code for this is not open source yet on the powerinfer repo https://github.com/SJTU-IPADS/PowerInfer however it is described in a paper https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf so significant work may be necessary to implement it

Edit so I believe llama.cpp offloads individual layers which are vertical here https://media.geeksforgeeks.org/wp-content/cdn-uploads/20230602113310/Neural-Networks-Architecture.png I thought about splitting the neural network into two parts that are wide, short and stacked on top of each other so that their orientation is horizontal in order to parallelize the model (so really tensor parallelism but with a CPU instead of other GPUs), one part would be processed by the cpu and another by the gpu so that each layer is processed by both CPU and GPU at the same time, but I believe the cpu would still be a bottleneck due to its lower speed so powerinfer's approach would still be better,

I don't know how exactly exo https://github.com/exo-explore/exo splits model layers across devices in a network but I guess it could be used as inspiration to speed up hybrid inference

Jackarry188 · 2025-02-15T15:24:38Z

Jackarry188
Feb 15, 2025

it is the -ngl N

argument, people

*I think

-ngl 0 means everything on cpu

a big number means everything on gpu

Im not sure about where or how it starts using gpu and at what numbers

-ngl large num only means offloading layers to VRAM,but the GPU not fully utilization when inference, and the CPU is very high, I not know how to improve the utilization. https://github.com/ggml-org/llama.cpp/discussions/11881

1 reply

ejrydhfs Feb 28, 2025

There's a paper called powerinfer that improves utilization but it hasn't been merged into llama.cpp. See #4543 and https://github.com/SJTU-IPADS/PowerInfer

As of right now it is not possible to improve utilization because Inference is only done sequentially and not in parallel using hybrid inference, however powerinfer solves this by classifying neurons into hot and cold ones aka commonly and rarely used ones respectively, and then it uses that information to divide neurons among CPUs and GPUs and perform tensor parallelism

Neuron classification is done by repeatedly evaluating the model aka solving and profiling the model, which is done by running the model repeatedly with prompts from a dataset.

It has not been merged because powerinfer is still a work in progress but pull requests are still welcome

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does llama.cpp achieve CPU-GPU hybrid mode? #8146

{{title}}

Replies: 7 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How does llama.cpp achieve CPU-GPU hybrid mode? #8146

Replies: 7 comments · 4 replies

slaren Aug 30, 2024 Maintainer

Replies: 7 comments 4 replies

slaren
Aug 30, 2024
Maintainer