Skip to content

RubricLab/inference

Repository files navigation

Rubric Inference Stack

A minimalist open-source LLM inference stack with structured outputs using SGLang and minimal API key auth via a Bun web server for 3x higher throughput than FastAPI.

The stack is currently highly-opinionated and likely to generalize and change.

Quickstart

The following assumes you're running on a Linux machine (likely in the cloud) with a GPU.

Docker

To build and run the Docker image:

docker build -t inference .
docker run -p 3000:3000 \
  -e SERVER_API_KEY=your_api_key \
  -e HF_TOKEN=your_hf_token \
  inference

The pre-built Docker image is also available on GHCR as rubriclab/inference:latest. To run the pre-built image:

docker run -p 3000:3000 \
  -e SERVER_API_KEY=your_api_key \
  -e HF_TOKEN=your_hf_token \
  ghcr.io/rubriclab/inference:latest

Skypilot

Install Skypilot and connect an infra provider. We recommend using uv for fast setup and Vast for competitively-priced GPUs or Runpod for a good experience.

First, grab an API key from your cloud provider (e.g. Vast or Runpod).

curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.10
source .venv/bin/activate
uv pip install "skypilot[vast,runpod]"

# Vast
uv pip install "vastai-sdk>=0.1.12"
echo "<your_vast_api_key>" > ~/.vast_api_key

# Runpod
uv pip install "runpod>=1.6.1"
runpod config # then enter your API key

sky launch skypilot.yaml

Client Example

Test the API from any OpenAI-compatible client:

cd test && bun i && touch .env

Populate your .env with:

BASE_URL=http://localhost:3000/v1
SERVER_API_KEY=your_api_key

Run the test:

bun index.ts

You should see a reasoning chain and a JSON payload conforming to the schema.

About

Minimal OS LLM inference stack

Resources

License

Stars

Watchers

Forks

Packages