A minimalist open-source LLM inference stack with structured outputs using SGLang and minimal API key auth via a Bun web server for 3x higher throughput than FastAPI.
The stack is currently highly-opinionated and likely to generalize and change.
The following assumes you're running on a Linux machine (likely in the cloud) with a GPU.
To build and run the Docker image:
docker build -t inference .
docker run -p 3000:3000 \
-e SERVER_API_KEY=your_api_key \
-e HF_TOKEN=your_hf_token \
inference
The pre-built Docker image is also available on GHCR as rubriclab/inference:latest
. To run the pre-built image:
docker run -p 3000:3000 \
-e SERVER_API_KEY=your_api_key \
-e HF_TOKEN=your_hf_token \
ghcr.io/rubriclab/inference:latest
Install Skypilot and connect an infra provider. We recommend using uv for fast setup and Vast for competitively-priced GPUs or Runpod for a good experience.
First, grab an API key from your cloud provider (e.g. Vast or Runpod).
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.10
source .venv/bin/activate
uv pip install "skypilot[vast,runpod]"
# Vast
uv pip install "vastai-sdk>=0.1.12"
echo "<your_vast_api_key>" > ~/.vast_api_key
# Runpod
uv pip install "runpod>=1.6.1"
runpod config # then enter your API key
sky launch skypilot.yaml
Test the API from any OpenAI-compatible client:
cd test && bun i && touch .env
Populate your .env with:
BASE_URL=http://localhost:3000/v1
SERVER_API_KEY=your_api_key
Run the test:
bun index.ts
You should see a reasoning chain and a JSON payload conforming to the schema.