Yarn is a FastAPI backend service that provides text-to-video retrieval capabilities using GenAI technologies. The system indexes video files by extracting key frames and generating embeddings, then matches user text queries to the most relevant videos.
- Automatic video indexing on startup
- Frame extraction using scene detection
- Embedding generation using CLIP
- Query processing using GPT-4o
- Multiple image generation models:
- Stable Diffusion via Replicate API (default)
- DALL-E via OpenAI API
- Two frame generation modes:
- Independent: each frame generated separately
- Continuous: frames generated as a coherent sequence
- Dynamic Time Warping (DTW) for comparing query and video embeddings
- REST API for searching videos by text query
- Debug mode for logging queries, frame descriptions, and generated images
- Clone the repository
- Install the required dependencies:
pip install -r requirements.txt
- Create a
.env
file with the following variables:
# OpenAI API key (required for GPT-4o frame descriptions and DALL-E if selected)
OPENAI_API_KEY=your_openai_api_key_here
# Replicate API token (required for Stable Diffusion if selected)
REPLICATE_API_TOKEN=your_replicate_api_token_here
# Video directory and search defaults
VIDEO_DIRECTORY=./videos
MAX_FRAMES_DEFAULT=5
TOP_K_DEFAULT=3
# Debug settings
DEBUG_MODE=false
DEBUG_LOG_DIR=./logs/debug
DEBUG_FRAMES_DIR=./logs/frames
-
Place your videos in the
videos
directory -
Start the server:
python run.py
-
The API will be available at http://localhost:8000
-
API documentation will be available at http://localhost:8000/docs
-
Test the API using the provided test client:
# Using Stable Diffusion (default) and independent frame generation
python test_client.py --query "A person walking on the beach"
# Using DALL-E for image generation
python test_client.py --query "A person walking on the beach" --image-model dalle
# Using continuous frame generation mode for more coherent sequences
python test_client.py --query "A person walking on the beach" --frame-mode continuous
Yarn implements a flexible foundation model framework that allows you to choose different models for various tasks:
-
Stable Diffusion (default): Uses Replicate API to generate images
- Independent mode: Uses standard image generation
- Continuous mode: Uses inpainting to maintain visual consistency
-
DALL-E: Uses OpenAI API to generate images
- Independent mode: Uses DALL-E 3 for high-quality images
- Continuous mode: Uses DALL-E 2's edit capabilities for consistency
- CLIP: OpenAI's Contrastive Language-Image Pre-training model for generating embeddings from images
POST /api/search/
Request body:
{
"query": "A person running on the beach at sunset",
"max_frames": 5,
"top_k": 3,
"frame_mode": "independent",
"image_model": "sd",
"embedding_model": "clip"
}
Response:
[
{
"video_id": "beach_sunset.mp4",
"distance": 0.28,
"score": 0.78
},
{
"video_id": "exercise.mp4",
"distance": 0.45,
"score": 0.69
},
{
"video_id": "nature.mp4",
"distance": 0.52,
"score": 0.66
}
]
Yarn supports two modes for generating frames from a query:
In this mode, each frame is generated independently based on its description. This gives the model maximum freedom to interpret each description without constraints from previous frames.
In this mode, frames are generated sequentially, with each frame using the previous frame as a reference:
- Stable Diffusion: Uses inpainting to maintain consistent elements while updating others
- DALL-E: Uses DALL-E 2's image editing capabilities to maintain consistency
This creates a more coherent sequence of frames with consistent characters, settings, colors, and visual style throughout the sequence.
Yarn includes a debug mode that logs detailed information about each search query, including:
- The original query text and parameters
- Generated frame descriptions
- Generated images for each frame description
- Search results
- Foundation models used
To enable debug mode, set the DEBUG_MODE
environment variable to true
in your .env
file:
DEBUG_MODE=true
Debug logs will be saved to:
./logs/debug/[session_id]/query.json
- Query details./logs/debug/[session_id]/descriptions.json
- Generated frame descriptions./logs/debug/[session_id]/results.json
- Search results./logs/frames/[session_id]/frame_##_*.jpg
- Generated images./logs/frames/[session_id]/metadata.json
- Image metadata
Yarn's architecture is designed to be easily extended with new foundation models:
- Add a new model enum in
app/models/video.py
- Implement the model class in
app/services/foundation_models.py
by extending the appropriate abstract base class - Register the model in the
ModelFactory
class
- Python 3.8+
- FastAPI
- OpenCV
- PyTorch
- Transformers (CLIP)
- OpenAI API access
- Replicate API access
- CUDA-capable GPU (recommended but not required)