A "reverse proxy" for multiple ollama servers running various models.
This is a lowest effort implementation of a reverse proxy for ollama, it takes chat and generation requests and depending on the model in the request it will dispatch the request to a server which has been specifically assigned to run the given model.
go run ./*.go --level=trace --address 0.0.0.0:11434 --proxy=llama3.2-vision=http://server-02:11434
--proxy=deepseek-r1:14b=http://server-01:11434
Official images are available on docker hub and ghcr.io. You can run the latest image from either:
- docker hub:
docker run -it -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" slawoc/gollamas:latest
- ghcr.io :
docker run -it -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" ghcr.io/slawo/gollamas:latest
- Manage models
- Map model aliases to existing model names (some tools only allow a pre-defined set of models)
- Set that by default only the configured models are returned when listing models
- Set option to allow requests to currently running models (ie server has additional model running)
- Set model in memory
- Preload models (ensure model is loaded uppon startup)
- Ping models (maintain model loaded)
- Add config to enforce model keep alive globally
"keep_alive": -1
- Add config to override model keep alive per model/server
"keep_alive": -1
- Set fixed size context
"options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) in each request
"options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) per model/server
"options": { "num_ctx": 4096 }
- Add config to enforce context size in each request
"options": { "num_ctx": 4096 }
- Add config to enforce context size per model/server
"options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) in each request
- Proxy API
-
DELETE /api/delete
-
GET /
-
GET /api/tags
-
GET /api/ps
-
GET /api/version
-
GET /v1/models
-
GET /v1/models/:model
-
HEAD /
-
HEAD /api/blobs/:digest
-
HEAD /api/tags
-
HEAD /api/version
-
POST /api/blobs/:digest
-
POST /api/chat
-
POST /api/copy
-
POST /api/create
-
POST /api/embed
-
POST /api/embeddings
-
POST /api/generate
-
POST /api/pull
-
POST /api/show
-
POST /api/push
-
POST /v1/chat/completions
-
POST /v1/completions
-
POST /v1/embeddings
-
The server relies on existing ollama models and middlewares to speed up the development of the initial implementation.
Only the requests which have a model
( or the deprecated name
) field are transfered to the right server.
Other endpoints hit all servers to either select one answer ie the lowest version
available, or combined into oone response.