A Python library for integrating crowd evaluation into your machine learning training loops. This library provides asynchronous, non-blocking evaluation of model outputs (currently supporting image generation) with automatic logging to Weights & Biases (wandb).
- Asynchronous Evaluation: Evaluations run in the background without blocking your training loop
- Wandb Integration: Results are automatically logged to your wandb runs with proper ordering
- Image Evaluation: Built-in support for evaluating generated images on multiple criteria
- Crowd-in-the-Loop: Uses Rapidata for high-quality crowd evaluation
- Easy Integration: Add evaluation to your training loop with just a few lines of code
import wandb
from src.crowd_eval.checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator
# Initialize wandb
run = wandb.init(project="my-project")
# Create evaluator
evaluator = ImageEvaluator(wandb_run=run, model_name="my-model")
# In your training loop
for step in range(100):
# ... your training code ...
# Generate or load validation images (every N steps)
if step % 10 == 0:
validation_images = ["path/to/image_1.png", "path/to/image_2.png"]
# Fire-and-forget evaluation - returns immediately!
evaluator.evaluate(validation_images)
# ... continue training ...
# Wait for all evaluations to complete before finishing
evaluator.wait_for_all_evaluations()
run.finish()
- Python 3.9+
- A Rapidata account with API credentials
- A Weights & Biases account
pip install crowd-eval
Install uv if you haven't already:
# For MacOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# For Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
- Create and activate a virtual environment:
uv venv # On Unix/macOS source .venv/bin/activate # On Windows .venv\Scripts\activate
- Install dependencies:
uv sync
Create a .env
file in your project root:
OPENAI_API_KEY=your_openai_api_key # If running the example file
RAPIDATA_CLIENT_ID=your_rapidata_client_id # If running on a server
RAPIDATA_CLIENT_SECRET=your_rapidata_client_secret # If running on a server
The ImageEvaluator
evaluates generated images on three key metrics:
- Preference: Overall crowd preference for the image
- Alignment: How well the image matches its text description
- Coherence: Visual quality and absence of artifacts
For the evaluator to function properly, your image files should adhere to the following naming convention: the image name must end with "_{prompt_id}". The rest of the filename structure is not significant.
Where {prompt_id}
corresponds to prompt IDs from the evaluation dataset. The evaluator will automatically validate that your images match available prompts.
uv venv
source .venv/bin/activate
uv sync
uv add openai dotenv
and log in to wandb:
wandb login
import os
import sys
import openai
import requests
import wandb
from src.crowd_eval.checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator
from dotenv import load_dotenv
load_dotenv()
# Setup
openai.api_key = os.getenv("OPENAI_API_KEY")
run = wandb.init(project="dalle-evaluation")
evaluator = ImageEvaluator(wandb_run=run, model_name="dalle-3")
def generate_and_save_image(prompt: str, file_location: str) -> str:
"""Generate image using DALL-E and save to disk."""
os.makedirs(os.path.dirname(file_location), exist_ok=True)
response = openai.images.generate(
model="dall-e-3",
prompt=prompt,
size="1024x1024",
quality="standard",
n=1
)
# Download and save image
image_url = response.data[0].url
image_data = requests.get(image_url).content
with open(file_location, 'wb') as f:
f.write(image_data)
return file_location
if __name__ == "__main__":
# Training simulation
for step in range(3):
# Simulate training
run.log({"Some training metric": step})
# Generate images for evaluation (using first 2 prompts)
validation_images = [
generate_and_save_image(prompt, f"validation_images/generated_image_run_{step}_{id}.png")
for id, prompt in list(evaluator.prompts.items())[:2]
]
# Evaluate asynchronously
evaluator.evaluate(validation_images)
print("This will run immediately, but the evaluations will run in the background.")
# Wait for all evaluations
evaluator.wait_for_all_evaluations()
run.finish()
By default, the ImageEvaluator
compares your generated images against a pre-defined set of baseline images from GPT-4o. However, you can define your own custom baseline images and prompts for more targeted evaluation scenarios.
Use the define_baseline()
method to specify your own baseline images and prompts:
# Define custom baseline with your own images and prompts
evaluator.define_baseline(
image_paths=[
"path/to/baseline_image_1.png",
"path/to/baseline_image_2.png",
"https://example.com/remote_baseline.jpg" # URLs also supported
],
prompts=[
"A serene mountain landscape",
"A futuristic city skyline",
"An abstract geometric pattern"
]
)
When you define a custom baseline:
- Image Naming: Your generated images no longer need to follow the
*_{prompt_id}.png
naming convention - Direct Comparison: Each generated image is compared directly against the corresponding baseline image at the same index
- Custom Prompts: The evaluation uses your provided prompts instead of the default dataset
- Matched Pairs: The number of generated images must match the number of baseline images
import wandb
from src.crowd_eval.checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator
# Initialize
run = wandb.init(project="custom-baseline-eval")
evaluator = ImageEvaluator(wandb_run=run, model_name="my-model")
# Set up custom baseline
evaluator.define_baseline(
image_paths=[
"baselines/reference_1.png",
"baselines/reference_2.png"
],
prompts=[
"A red sports car",
"A sunset over the ocean"
]
)
# Training loop
for step in range(10):
# Your training code here...
if step % 5 == 0:
# Generate images for your custom prompts
generated_images = [
f"outputs/step_{step}_car.png", # Compares against baselines/reference_1.png
f"outputs/step_{step}_sunset.png" # Compares against baselines/reference_2.png
]
# Evaluate against your custom baseline
evaluator.evaluate(generated_images)
# Wait for evaluations and finish
evaluator.wait_for_all_evaluations()
run.finish()
- Domain-Specific Evaluation: Use baselines relevant to your specific use case
- Consistent Comparison: Compare against the same reference images across training runs
- Flexible Prompts: Use any prompts that make sense for your model's intended application
- Quality Control: Establish known-good reference images as quality benchmarks
"Invalid prompt ids" error:
- Ensure image filenames follow the pattern:
*_{prompt_id}.png
- Check that
{prompt_id}
exists in the evaluation dataset
Evaluations not appearing in wandb:
- Call
evaluator.wait_for_all_evaluations()
beforerun.finish()
- Check your Rapidata API credentials
- Verify internet connectivity for API calls
"Module not found" error:
- Ensure you have the correct dependencies installed
- Ensure your example code is run from the root of the repository
Required:
RAPIDATA_CLIENT_ID
: Your Rapidata client ID (Not required if running locally)RAPIDATA_CLIENT_SECRET
: Your Rapidata client secret (Not required if running locally)
Optional:
OPENAI_API_KEY
: For image generation examples