Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit default context size in the node template #435

Open
5 tasks
CrossPr0duct opened this issue Feb 23, 2025 · 4 comments · May be fixed by #444
Open
5 tasks

Limit default context size in the node template #435

CrossPr0duct opened this issue Feb 23, 2025 · 4 comments · May be fixed by #444
Assignees
Labels
bug Something isn't working

Comments

@CrossPr0duct
Copy link

Issue description

When loading a 8B Model in the npm create node-llama-cpp@latest it saturates memory to 24GB.

Expected Behavior

should only use 8~GB of vram

Actual Behavior

Shouldn't this only use 8GB of vram. I am using Q8.
My GPU Memory looks like 3 GB to start then jumps to 24

Steps to reproduce

Just install the latest npm create node-llama-cpp@latest and create an app run npm install then npm start and load the 8GB llama model.

My Environment

OS: Windows 10.0.26100 (x64) <-- says windows 10? but actually 11
Node: 22.13.0 (x64)
TypeScript: 5.7.3
node-llama-cpp: 3.6.0

CUDA: available
Vulkan: available

CUDA device: NVIDIA GeForce RTX 4090
CUDA used VRAM: 6.38% (1.53GB/23.99GB)
CUDA free VRAM: 93.61% (22.46GB/23.99GB)

Vulkan device: NVIDIA GeForce RTX 4090
Vulkan used VRAM: 6.38% (1.53GB/23.99GB)
Vulkan free VRAM: 93.61% (22.46GB/23.99GB)
Vulkan unified memory: 512MB (2.08%)

CPU model: AMD Ryzen 9 7900X 12-Core Processor
Math cores: 12
Used RAM: 50.15% (63.75GB/127.12GB)
Free RAM: 49.84% (63.37GB/127.12GB)
Used swap: 51.24% (76.41GB/149.12GB)
Max swap size: 149.12GB
mmap: supported

Additional Context

No response

Relevant Features Used

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

@CrossPr0duct CrossPr0duct added bug Something isn't working requires triage Requires triaging labels Feb 23, 2025
@CrossPr0duct
Copy link
Author

@giladgd why does a 8B q8 model require 21 GB of VRAM this wasn't the case for llama cpp server prior.

@giladgd
Copy link
Contributor

giladgd commented Feb 23, 2025

By default, node-llama-cpp uses the largest context size that can be fitted in your GPU’s VRAM (up to the model's training context size), to allow the model to ingest as much information as possible before a context shift happens, making it produce significantly higher quality responses when using long inputs.
The llama.cpp server uses a significantly shorter context size by default (4096).

If you don’t need such a long context size, you can configure it when creating a context.
Here's an example of how you can do that:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
});
const context = await model.createContext({
    contextSize: {
        max: 4096
    }
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

I'm working on an incremental allocation during runtime to optimize memory consumption while still supporting advanced use cases without having to configure anything, but it's not ready yet.

@CrossPr0duct
Copy link
Author

@giladgd that makes a lot of sense. This might leave a very bad impression on the library I had thought it was broken, perhaps you should add a warning or go with a default context size? It did that and my whole computer froze basically and it took forever to load.

@giladgd
Copy link
Contributor

giladgd commented Feb 25, 2025

@CrossPr0duct You're right. I'll add a default limit to the max context size in the node template for now, until the incremental allocation is ready.
Thanks for reporting this :)

@giladgd giladgd self-assigned this Feb 25, 2025
@giladgd giladgd removed the requires triage Requires triaging label Feb 25, 2025
@giladgd giladgd changed the title npm create node-llama-cpp memory issue. limit default context size in the node template Feb 25, 2025
@giladgd giladgd changed the title limit default context size in the node template Limit default context size in the node template Feb 25, 2025
@giladgd giladgd linked a pull request Mar 20, 2025 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants