While the best examples of LLM functionality are eye-popping, putting them in production while ensuring reliability across all the inputs and scenarios is a challenge. LLM Bench is a lightweight tool for quickly building up "integration tests" to ensure LLMs are doing what you want:
- Test that prompts perform well across a wide spectrum of inputs
- Compare models, and make informed trade-offs between performance, latency, and cost
- Understand the limitations of the models you use and where they fall short
- Learn to prompt better systematically, rather than with one-off tests
LLM Bench is designed to be easy-to-use and extensible. It's built with TypeScript and runs on Interval
Features:
- Simple UI for building up custom benchmarks to evaluate and experiment across different models and prompts
- Compare results quickly and view past benchmark runs to detect regressions
- No coding required, but hacker friendly and easy to customize
- Automatically prompt and ensure LLMs output type-checked JSON for easier evaluation of tasks and completions.
- Evals run locally by default
- Unopinionated and extensible
Maybe coming soon:
- LLM generated examples based off existing benchmark + prompt
- More eval methods beyond string matching, custom function
- Human-in-the-loop benchmark evals
- Integrate application specific context via embeddings
- Compare benchmark results side by side in a table
- Export benchmarks as CSVs
To run LLM Bench you'll need the following:
- An Interval account to host the dashboard
- A Postgres database (feel free to run it locally, or use a service like Supabase for a free, hosted db)
- Some API keys for LLM access (or you can run your own model)
Here's the step-by-step to get started. First pull down the code to run locally.
git clone [email protected]:interval/llm-bench.git
cd llm-bench
yarn install
Next set the following environment variables in your .env
file. (You can use .env.sample
as a template.)
INTERVAL_KEY=
DATABASE_URL=
OPENAI_KEY=
COHERE_KEY=
Create or find an existing Interval key in your Interval dashboard. DATABASE_URL
should point to your Postgres database. Set the OpenAI or Cohere keys depending on which API based LLMs you plan to utilize.
Next, initialize your database with the required schema.
yarn prisma migrate dev --name init
You can now start the app.
yarn dev
Access LLM bench in your Interval dashboard.
You can run LLM Bench against any language model, we just need a API endpoint to call. Feel free to add alternative models to src/utils/models.ts
and look to the existing createCompletion
definitions for example calls.
This repo includes a script to host models available on Hugging Face via Modal.
To run it, you'll need a Modal account and to install their client:
pip insall modal
Start up your ephemeral app with modal serve src/huggingface-llm.py
. This will boot up a GPU instance to host a language modal and a webserver to access it. It's currently set to run tiiuae/falcon-7b-instruct. On first run, the instance will need to download the modal, which may take a while.
To run against this model with LLM Bench, add the generated Modal API url (this will be returned when you run modal serve
) to your .env
file as LLM_API_ENDPOINT
and update the available huggingfaceModels
in src/utils/huggingface.ts
to whichever model you decide to run.