Scrappy Llama Server model swapping proxy middleware #11286
lukestanley
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wanted to quickly switch models and save GPU memory / power when idle, and be able to still use speculative decoding and the latest Llama.cpp Server goodness and I am not so experienced with C++ so I wrote this middleware for myself.
When a different model to the currently loaded model is specified, it loads it then proxies the request in a streaming way as normal.
It's just over 200 lines of code and you can ask a LLM what it depends on and how to use it if that's of use to you!
https://gist.github.com/lukestanley/2577d0b8fcb02e678b202fe0fd924b15
It's a hot mess of a GitHub gist but it works. I will it tidy up and think how to do it a bit better.
I have been using Ollama and wanted the latest new features and speed that speculative decoding brings.
Beta Was this translation helpful? Give feedback.
All reactions