gollamas

A "reverse proxy" for multiple ollama servers running various models.

This is a lowest effort implementation of a reverse proxy for ollama, it takes chat and generation requests and depending on the model in the request it will dispatch the request to a server which has been specifically assigned to run the given model.

run locally

go run ./*.go --level=trace --address 0.0.0.0:11434 --proxy=llama3.2-vision=http://server-02:11434 
--proxy=deepseek-r1:14b=http://server-01:11434

run on docker

Official images are available on docker hub and ghcr.io. You can run the latest image from either:

docker hub: docker run -it -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" slawoc/gollamas:latest
ghcr.io : docker run -it -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" ghcr.io/slawo/gollamas:latest

Features

Manage models
- Map model aliases to existing model names (some tools only allow a pre-defined set of models)
- Set that by default only the configured models are returned when listing models
- Set option to allow requests to currently running models (ie server has additional model running)
Set model in memory
- Preload models (ensure model is loaded uppon startup)
- Ping models (maintain model loaded)
- Add config to enforce model keep alive globally "keep_alive": -1
- Add config to override model keep alive per model/server "keep_alive": -1
Set fixed size context "options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) in each request "options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) per model/server "options": { "num_ctx": 4096 }
- Add config to enforce context size in each request "options": { "num_ctx": 4096 }
- Add config to enforce context size per model/server "options": { "num_ctx": 4096 }
Proxy API
- DELETE /api/delete
- GET /
- GET /api/tags
- GET /api/ps
- GET /api/version
- GET /v1/models
- GET /v1/models/:model
- HEAD /
- HEAD /api/blobs/:digest
- HEAD /api/tags
- HEAD /api/version
- POST /api/blobs/:digest
- POST /api/chat
- POST /api/copy
- POST /api/create
- POST /api/embed
- POST /api/embeddings
- POST /api/generate
- POST /api/pull
- POST /api/show
- POST /api/push
- POST /v1/chat/completions
- POST /v1/completions
- POST /v1/embeddings

Internals

The server relies on existing ollama models and middlewares to speed up the development of the initial implementation. Only the requests which have a model ( or the deprecated name) field are transfered to the right server.

Other endpoints hit all servers to either select one answer ie the lowest version available, or combined into oone response.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
mocks		mocks
.editorconfig		.editorconfig
.gitignore		.gitignore
.goreleaser.yml		.goreleaser.yml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ginHelpers.go		ginHelpers.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go
reflect.go		reflect.go
reflect_internal_test.go		reflect_internal_test.go
router.go		router.go
routerOptions.go		routerOptions.go
router_test.go		router_test.go
service.go		service.go
service_test.go		service_test.go
version.go		version.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gollamas

run locally

run on docker

Features

Internals

About

Releases 8

Packages

Contributors 2

Languages

License

slawo/gollamas

Folders and files

Latest commit

History

Repository files navigation

gollamas

run locally

run on docker

Features

Internals

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages