How to run a llama server with fixed prompt cache without caching each of my upcoming queries? #14282
Unanswered
NIKHILDUGAR
asked this question in
Q&A
Replies: 1 comment 14 replies
-
After each request, send a dummy request with the original "fixed" prompt and |
Beta Was this translation helpful? Give feedback.
14 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am currently running my server as such:-
CUDA_VISIBLE_DEVICES="0" ./llama.cpp/build/bin/llama-server --model llama.cpp/34B/km.gguf --gpu-layers 99999 --no-context-shift --ctx-size 8000 --keep -1
and pass my prompt as
Now the problem I am facing with this is :
cache_prompt: true
makes each upcoming query cached and stores as history and which I think believe causes inconsistency in results in multiple runs (although I am aware caching doesn't change the probabilities but i believe it keeps it inits history or something to influence the outputs) .Appreciate any and all help and advice. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions