-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit default context size in the node template #435
Comments
@giladgd why does a 8B q8 model require 21 GB of VRAM this wasn't the case for llama cpp server prior. |
By default, If you don’t need such a long context size, you can configure it when creating a context. import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
});
const context = await model.createContext({
contextSize: {
max: 4096
}
});
const session = new LlamaChatSession({
contextSequence: context.getSequence()
});
const q1 = "Hi there, how are you?";
console.log("User: " + q1);
const a1 = await session.prompt(q1);
console.log("AI: " + a1); I'm working on an incremental allocation during runtime to optimize memory consumption while still supporting advanced use cases without having to configure anything, but it's not ready yet. |
@giladgd that makes a lot of sense. This might leave a very bad impression on the library I had thought it was broken, perhaps you should add a warning or go with a default context size? It did that and my whole computer froze basically and it took forever to load. |
@CrossPr0duct You're right. I'll add a default limit to the max context size in the node template for now, until the incremental allocation is ready. |
Issue description
When loading a 8B Model in the npm create node-llama-cpp@latest it saturates memory to 24GB.
Expected Behavior
should only use 8~GB of vram
Actual Behavior
Shouldn't this only use 8GB of vram. I am using Q8.
My GPU Memory looks like 3 GB to start then jumps to 24
Steps to reproduce
Just install the latest npm create node-llama-cpp@latest and create an app run npm install then npm start and load the 8GB llama model.
My Environment
OS: Windows 10.0.26100 (x64) <-- says windows 10? but actually 11
Node: 22.13.0 (x64)
TypeScript: 5.7.3
node-llama-cpp: 3.6.0
CUDA: available
Vulkan: available
CUDA device: NVIDIA GeForce RTX 4090
CUDA used VRAM: 6.38% (1.53GB/23.99GB)
CUDA free VRAM: 93.61% (22.46GB/23.99GB)
Vulkan device: NVIDIA GeForce RTX 4090
Vulkan used VRAM: 6.38% (1.53GB/23.99GB)
Vulkan free VRAM: 93.61% (22.46GB/23.99GB)
Vulkan unified memory: 512MB (2.08%)
CPU model: AMD Ryzen 9 7900X 12-Core Processor
Math cores: 12
Used RAM: 50.15% (63.75GB/127.12GB)
Free RAM: 49.84% (63.37GB/127.12GB)
Used swap: 51.24% (76.41GB/149.12GB)
Max swap size: 149.12GB
mmap: supported
Additional Context
No response
Relevant Features Used
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, and I know how to start.
The text was updated successfully, but these errors were encountered: