Replies: 7 comments 4 replies
-
where is the answer? *it may be -ngl, offloading of some layers it says |
Beta Was this translation helpful? Give feedback.
-
I have the same question |
Beta Was this translation helpful? Give feedback.
-
@HuangLED were you able to build/run it in hybrid mode? |
Beta Was this translation helpful? Give feedback.
-
it is the -ngl N *I think -ngl 0 means everything on cpu Im not sure about where or how it starts using gpu and at what numbers |
Beta Was this translation helpful? Give feedback.
-
I have the same question. Is there any more detailed technical information available? So far, I've only found some information in the PowerInfer paper. |
Beta Was this translation helpful? Give feedback.
-
Some layers are run on the CPU and others on the GPU, sequentially. You can set the |
Beta Was this translation helpful? Give feedback.
-
-ngl large num only means offloading layers to VRAM,but the GPU not fully utilization when inference, and the CPU is very high, I not know how to improve the utilization. https://github.com/ggml-org/llama.cpp/discussions/11881 |
Beta Was this translation helpful? Give feedback.
-
Home page says "CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity"
I'd like to learn more about it, any design or code pointer to how it is actually done?
Much appreciated.
Beta Was this translation helpful? Give feedback.
All reactions