diff --git a/.cursor/rules/agent-development.mdc b/.cursor/rules/agent-development.mdc new file mode 100644 index 0000000..29d121f --- /dev/null +++ b/.cursor/rules/agent-development.mdc @@ -0,0 +1,171 @@ + +# LiveKit Agent Workflows + +## Agent Architecture Overview + +LiveKit Agents implement conversational AI workflows through a structured pipeline: +- **Speech-to-Text (STT)**: Convert audio input to text +- **Large Language Model (LLM)**: Process conversation and generate responses +- **Text-to-Speech (TTS)**: Convert text responses to audio +- **Turn Detection**: Determine when user has finished speaking +- **Voice Activity Detection (VAD)**: Detect speech presence + +## Agent Implementation Patterns + +### Core Agent Class +```python +from livekit.agents import Agent, RunContext, function_tool + +class ConversationalAgent(Agent): + def __init__(self): + super().__init__() + # Define agent behavior through instructions + self.instructions = """ + System prompt defining: + - Agent personality and role + - Available capabilities + - Communication style + - Behavioral boundaries + """ + + @function_tool + async def custom_capability(self, context: RunContext, parameter: str): + """Function tools extend agent capabilities beyond conversation. + + Args: + parameter: Clear description for LLM understanding + """ + # Implementation logic + return "Tool result" +``` + +### Agent Lifecycle & Context + +#### RunContext Usage +- **Session Access**: `context.room` for room information +- **State Management**: Track conversation state across turns +- **Event Handling**: Respond to room events and participant actions +- **Resource Management**: Handle cleanup and resource disposal + +#### Conversation Flow +1. **Audio Reception**: Agent receives participant audio stream +2. **Speech Processing**: STT converts audio to text transcript +3. **LLM Processing**: Language model generates response using instructions and tools +4. **Audio Generation**: TTS converts response to audio +5. **Turn Management**: System detects conversation turns and manages interruptions + +## Pipeline Configuration Patterns + +### Session Setup +```python +async def entrypoint(ctx: JobContext): + await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY) + + # Configure the conversational AI pipeline + session = AgentSession( + stt=provider.STT(), # Speech recognition + llm=provider.LLM(), # Language understanding/generation + tts=provider.TTS(), # Speech synthesis + turn_detector=provider.TD(), # End-of-turn detection + vad=provider.VAD() # Voice activity detection + ) + + # Start the agent workflow + session.start(YourAgent()) +``` + +### Pipeline Variations + +#### Traditional Multi-Provider Pipeline +- Separate providers for each component (STT, LLM, TTS) +- Maximum flexibility in provider selection +- Optimized for specific use cases (latency, quality, cost) + +#### Unified Provider Pipeline (e.g., OpenAI Realtime) +- Single provider handles entire conversation flow +- Reduced latency through integrated processing +- Built-in voice activity detection and turn management + +## Function Tool Patterns + +### Tool Design Principles +- **Clear Documentation**: LLM uses docstrings to understand tool purpose +- **Error Handling**: Graceful failure with meaningful user feedback +- **Async Implementation**: Non-blocking execution for real-time performance +- **Context Awareness**: Leverage RunContext for session-specific behavior + +### Tool Categories +- **Information Retrieval**: API calls, database queries, web searches +- **Actions**: External system integration, state changes +- **Computation**: Data processing, calculations, transformations +- **Media Processing**: Image analysis, file handling, content generation + +## Voice Pipeline Optimization + +### Turn Detection Strategies +- **VAD-Only**: Simple voice activity detection +- **Semantic Turn Detection**: Context-aware conversation boundaries +- **Hybrid Approach**: VAD + semantic analysis for optimal user experience + +### Latency Optimization +- **Model Selection**: Balance capability vs. response time +- **Streaming**: Real-time processing where supported +- **Caching**: Reduce repeated processing overhead +- **Connection Management**: Maintain persistent connections + +## Error Handling & Resilience + +### Common Failure Modes +- **Provider Outages**: Network issues, service unavailability +- **Audio Quality**: Poor input affecting transcription accuracy +- **Tool Failures**: External service errors, timeout conditions +- **Resource Limits**: Rate limiting, quota exhaustion + +### Resilience Patterns +- **Graceful Degradation**: Reduced functionality during partial failures +- **Retry Logic**: Intelligent retry with backoff strategies +- **Fallback Providers**: Alternative services for critical components +- **User Communication**: Clear error messages and recovery guidance + +## Testing Conversational Agents + +### LLM-Based Evaluation +```python +# Test conversational behavior with semantic evaluation +async def test_agent_response(): + async with AgentSession(llm=test_llm) as session: + await session.start(YourAgent()) + result = await session.run(user_input="test scenario") + + # Evaluate response quality using LLM judgment + await result.expect.next_event().is_message(role="assistant").judge( + llm=judge_llm, + intent="Expected behavior description" + ) +``` + +### Tool Testing +```python +# Mock external dependencies for reliable testing +with mock_tools(YourAgent, {"external_api": mock_response}): + # Test tool behavior under controlled conditions +``` + +## Monitoring & Observability + +### Built-in Metrics +- **Performance**: Latency, throughput, error rates +- **Usage**: Token consumption, API calls, session duration +- **Quality**: Turn accuracy, interruption handling, user satisfaction + +### Custom Metrics Collection +```python +@session.on("metrics_collected") +def handle_metrics(event: MetricsCollectedEvent): + # Process and forward metrics to monitoring systems + custom_analytics.track(event.metrics) +``` + +- STT: Audio duration, transcript time, streaming mode +- LLM: Completion duration, token usage, TTFT +- TTS: Audio duration, character count, generation time diff --git a/.cursor/rules/ai-providers.mdc b/.cursor/rules/ai-providers.mdc new file mode 100644 index 0000000..45fdba5 --- /dev/null +++ b/.cursor/rules/ai-providers.mdc @@ -0,0 +1,267 @@ +--- +description: "LiveKit AI provider integrations and advanced configurations" +--- + +# AI Provider Integrations & Extensions + +Advanced patterns for integrating different AI providers and extending agent capabilities. + +## Provider Swapping Patterns + +### LLM Providers +All follow consistent interfaces - easily swap between providers: + +#### OpenAI +```python +from livekit.agents.integrations import openai + +llm = openai.LLM(model="gpt-4o-mini") +# Realtime API alternative +llm = openai.realtime.RealtimeModel( + model="gpt-4o-realtime-preview", + voice="alloy", + temperature=0.8, +) +``` + +#### Anthropic Claude +```python +from livekit.agents.integrations import anthropic + +llm = anthropic.LLM(model="claude-3-haiku") +``` + +#### Google Gemini +```python +from livekit.agents.integrations import google + +llm = google.LLM(model="gemini-1.5-flash") +``` + +#### Azure OpenAI +```python +from livekit.agents.integrations import azure_openai + +llm = azure_openai.LLM( + model="gpt-4o", + azure_endpoint="https://your-resource.openai.azure.com/", + api_version="2024-02-15-preview" +) +``` + +### STT Providers + +#### Deepgram (Recommended) +```python +from livekit.agents.integrations import deepgram + +stt = deepgram.STT( + model="nova-3", + language="multi" # Multilingual support +) +``` + +#### AssemblyAI +```python +from livekit.agents.integrations import assemblyai + +stt = assemblyai.STT() +``` + +#### Azure AI Speech +```python +from livekit.agents.integrations import azure_ai_speech + +stt = azure_ai_speech.STT( + speech_key="your-key", + speech_region="your-region" +) +``` + +### TTS Providers + +#### Cartesia (Low Latency) +```python +from livekit.agents.integrations import cartesia + +tts = cartesia.TTS( + model="sonic-english", + voice="79a125e8-cd45-4c13-8a67-188112f4dd22" +) +``` + +#### ElevenLabs (High Quality) +```python +from livekit.agents.integrations import elevenlabs + +tts = elevenlabs.TTS( + model="eleven_turbo_v2_5", + voice="rachel" +) +``` + +#### Azure AI Speech +```python +from livekit.agents.integrations import azure_ai_speech + +tts = azure_ai_speech.TTS( + speech_key="your-key", + speech_region="your-region", + voice="en-US-JennyNeural" +) +``` + +## Advanced Pipeline Configurations + +### OpenAI Realtime API Complete Pipeline +Replace traditional STT-LLM-TTS with single provider: +```python +async def entrypoint(ctx: JobContext): + await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY) + + session = AgentSession( + llm=openai.realtime.RealtimeModel( + model="gpt-4o-realtime-preview", + voice="alloy", + temperature=0.8, + instructions="Your agent instructions here" + ) + ) + + session.start(Assistant()) +``` + +### Multimodal Vision Support +For agents that process visual input: +```python +# Configure for multimodal input +session = AgentSession( + llm=openai.LLM( + model="gpt-4o", # Vision-capable model + temperature=0.7 + ), + # ... other components +) + +# In your agent tools +@function_tool +async def analyze_image(self, context: RunContext, description: str): + """Analyze images shared in the conversation. + + Args: + description: Description of what to look for in the image + """ + # Access video frames or images from context + # Implement image analysis logic + return "Analysis result" +``` + +### Turn Detection Options + +#### LiveKit Turn Detector (Recommended) +```python +from livekit.agents import turn_detector as detect + +# English optimized (smaller, faster) +turn_detector = detect.EnglishModel() + +# Multilingual support (larger, more languages) +turn_detector = detect.MultilingualModel() +``` + +#### VAD-only Detection +```python +from livekit.agents.integrations import silero + +vad = silero.VAD( + min_speech_duration=0.1, + min_silence_duration=0.5 +) +``` + +## Environment Variables by Provider + +### OpenAI +```bash +OPENAI_API_KEY=sk-... +OPENAI_ORG_ID=org-... # Optional +``` + +### Anthropic +```bash +ANTHROPIC_API_KEY=sk-ant-... +``` + +### Google +```bash +GOOGLE_API_KEY=AIza... +# OR for service account +GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json +``` + +### Azure +```bash +AZURE_OPENAI_API_KEY=... +AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/ +AZURE_SPEECH_KEY=... +AZURE_SPEECH_REGION=... +``` + +### Other Providers +```bash +DEEPGRAM_API_KEY=... +CARTESIA_API_KEY=... +ASSEMBLYAI_API_KEY=... +ELEVENLABS_API_KEY=... +``` + +## Performance Considerations + +### Latency Optimization +- **Ultra-low latency**: OpenAI Realtime API +- **Low latency**: Cartesia TTS + Deepgram STT +- **Balanced**: Standard pipeline with optimized models +- **High quality**: ElevenLabs TTS with larger models + +### Cost Optimization +- **Budget**: Use smaller models (gpt-4o-mini, claude-haiku) +- **Balanced**: Mix providers based on capability needs +- **Premium**: Larger models for complex reasoning tasks + +### Model Selection Guidelines +- **Reasoning**: GPT-4o, Claude-3.5-Sonnet +- **Speed**: GPT-4o-mini, Claude-3-Haiku, Gemini Flash +- **Multimodal**: GPT-4o, Claude-3.5-Sonnet +- **Code**: GPT-4o, Claude-3.5-Sonnet +- **Conversations**: All models suitable + +## Integration Examples + +### Hybrid Provider Setup +Use different providers for different capabilities: +```python +session = AgentSession( + stt=deepgram.STT(model="nova-3"), # Best STT + llm=anthropic.LLM(model="claude-3-haiku"), # Fast LLM + tts=cartesia.TTS(model="sonic-english"), # Low-latency TTS + turn_detector=detect.MultilingualModel(), # Best turn detection + vad=silero.VAD() +) +``` + +### Provider Fallbacks +Implement graceful fallbacks between providers: +```python +try: + llm = openai.LLM(model="gpt-4o-mini") +except Exception: + logger.warning("OpenAI unavailable, falling back to Anthropic") + llm = anthropic.LLM(model="claude-3-haiku") +``` + +## Documentation Links +- **All Integrations**: https://docs.livekit.io/agents/integrations/ +- **LLM Providers**: https://docs.livekit.io/agents/integrations/llm/ +- **STT Providers**: https://docs.livekit.io/agents/integrations/stt/ +- **TTS Providers**: https://docs.livekit.io/agents/integrations/tts/ +- **OpenAI Realtime**: https://docs.livekit.io/agents/integrations/realtime/openai diff --git a/.cursor/rules/deployment-config.mdc b/.cursor/rules/deployment-config.mdc new file mode 100644 index 0000000..a01f272 --- /dev/null +++ b/.cursor/rules/deployment-config.mdc @@ -0,0 +1,147 @@ +--- +globs: "*.toml,*.env,*.env.example,Dockerfile,.dockerignore,*.yaml,*.yml" +description: "LiveKit Agent deployment and configuration guidance" +--- + +# Deployment & Configuration Guide + +Configuration patterns for deploying LiveKit agents using [pyproject.toml](mdc:pyproject.toml), [Dockerfile](mdc:Dockerfile), and [livekit.toml](mdc:livekit.toml). + +## Environment Configuration + +### Required Environment Variables +Essential API keys and configuration: +- `LIVEKIT_URL` - LiveKit server URL +- `LIVEKIT_API_KEY` - LiveKit API key +- `LIVEKIT_API_SECRET` - LiveKit API secret +- `OPENAI_API_KEY` - OpenAI API key +- `DEEPGRAM_API_KEY` - Deepgram STT API key +- `CARTESIA_API_KEY` - Cartesia TTS API key + +### Environment Setup Commands +- `lk app env -w .env` - Auto-load LiveKit environment using CLI +- Copy `.env.example` to `.env` and configure values +- Use `python-dotenv` for loading environment variables in code + +## Dependency Management with uv + +### Core Dependencies in pyproject.toml +```toml +[project] +dependencies = [ + "livekit-agents", + "livekit-agents-integrations[openai,deepgram,cartesia,silero]", + "python-dotenv", +] + +[project.optional-dependencies] +dev = [ + "pytest>=8.0.0", + "pytest-asyncio", + "ruff", +] +``` + +### Installation Commands +- `uv sync` - Install production dependencies +- `uv sync --dev` - Install with development tools +- `uv lock` - Update dependency lock file +- `uv add package-name` - Add new dependency + +## Docker Deployment + +### Dockerfile Best Practices +Based on included [Dockerfile](mdc:Dockerfile): +- Use Python slim base image for smaller size +- Install uv for fast dependency management +- Copy requirements and install before copying source code +- Use non-root user for security +- Set proper environment variables +- Handle model downloads in initialization + +### Docker Commands +- `docker build -t agent .` - Build image +- `docker run --env-file .env agent` - Run with environment +- `docker run -d --restart unless-stopped agent` - Run as daemon + +## LiveKit Cloud Configuration + +### livekit.toml Structure +```toml +[project] +name = "agent-starter-python" +agent_dir = "src" +watch_paths = ["src"] + +[env_vars] +OPENAI_API_KEY = "$OPENAI_API_KEY" +DEEPGRAM_API_KEY = "$DEEPGRAM_API_KEY" +CARTESIA_API_KEY = "$CARTESIA_API_KEY" +``` + +### Cloud Deployment Commands +- `lk deploy` - Deploy to LiveKit Cloud +- `lk logs` - View deployment logs +- `lk status` - Check deployment status + +## CI/CD Integration + +### GitHub Actions Workflows +Common workflows to include: +- **Linting**: `ruff check` and `ruff format --check` +- **Testing**: `pytest` with evaluation tests +- **Docker Build**: Build and push container images +- **Deployment**: Auto-deploy on main branch + +### Pre-commit Hooks +Recommended hooks for code quality: +```yaml +repos: + - repo: https://github.com/astral-sh/ruff-pre-commit + hooks: + - id: ruff + - id: ruff-format +``` + +## Production Considerations + +### Performance Optimization +- Pre-download models before deployment +- Use appropriate model sizes for latency requirements +- Configure turn detection thresholds +- Monitor resource usage and scaling + +### Security Best Practices +- Never commit API keys to version control +- Use environment variables for all secrets +- Run containers as non-root users +- Implement proper error handling and logging +- Use HTTPS/WSS for all connections + +### Monitoring & Logging +- Enable metrics collection in agent code +- Set up log aggregation for production +- Monitor usage and costs across providers +- Implement health checks and alerting + +### Scaling Considerations +- Design agents to be stateless where possible +- Use proper resource limits in containers +- Implement connection pooling for external services +- Consider load balancing for multiple instances + +## File Management + +### Files to Track in Version Control +- `uv.lock` - For reproducible builds +- `livekit.toml` - If using LiveKit Cloud +- `.env.example` - Template for environment variables +- `Dockerfile` and `.dockerignore` +- CI/CD workflow files + +### Files to Ignore +- `.env` - Contains secrets +- `*.log` - Log files +- Model download caches +- Virtual environment directories +- IDE-specific files diff --git a/.cursor/rules/livekit-core.mdc b/.cursor/rules/livekit-core.mdc new file mode 100644 index 0000000..81f1a30 --- /dev/null +++ b/.cursor/rules/livekit-core.mdc @@ -0,0 +1,74 @@ +--- +alwaysApply: true +description: "Core LiveKit Agent development guidance and patterns" +--- + +# LiveKit Agent Development Guide + +This is a LiveKit Agent starter template. Use these patterns and commands for development. + +## Essential Commands + +### Setup & Environment +- `uv sync` - Install dependencies to virtual environment +- `uv sync --dev` - Install dev tools (pytest, ruff) +- Copy `.env.example` to `.env` and configure API keys +- `lk app env -w .env` - Auto-load LiveKit environment using CLI + +### Running the Agent +- `uv run python src/agent.py download-files` - Download models before first run +- `uv run python src/agent.py console` - Terminal interaction mode +- `uv run python src/agent.py dev` - Development mode for frontend/telephony +- `uv run python src/agent.py start` - Production mode + +### Code Quality +- `uv run ruff check .` - Run linter +- `uv run ruff format .` - Format code +- `uv run pytest` - Run tests + +## Architecture Patterns + +### Core Components +LiveKit agents follow this structure: +- **Agent Class** - Inherits from `Agent`, contains instructions and function tools +- **Entrypoint Function** - Sets up the voice AI pipeline (STT/LLM/TTS) +- **Function Tools** - Extend agent capabilities beyond conversation + +### Voice AI Pipeline +- **STT**: Deepgram Nova-3 (multilingual) +- **LLM**: OpenAI GPT-4o-mini (easily swappable) +- **TTS**: Cartesia for voice synthesis +- **Turn Detection**: LiveKit's multilingual turn detection +- **VAD**: Silero VAD for voice activity detection + +## Key Development Patterns + +### Extending Agent Capabilities +Function tools enable agents to perform actions beyond conversation: +```python +@function_tool +async def external_integration(self, context: RunContext, param: str): + """Tools extend agent capabilities with external integrations. + + Args: + param: Clear parameter description for LLM understanding + """ + # Integration logic (APIs, databases, computations, etc.) + return "result" +``` + +### Modular AI Pipeline +LiveKit's provider abstraction enables flexible AI component selection: +- **Language Models**: OpenAI, Anthropic, Google, Azure, local models +- **Speech Recognition**: Deepgram, AssemblyAI, Azure, Google, OpenAI +- **Voice Synthesis**: Cartesia, ElevenLabs, Azure, Polly, OpenAI + +### Environment Variables Required +- `LIVEKIT_URL`, `LIVEKIT_API_KEY`, `LIVEKIT_API_SECRET` +- Provider keys: `OPENAI_API_KEY`, `DEEPGRAM_API_KEY`, `CARTESIA_API_KEY` +- These are just for the default providers - new providers may require new keys. + +## Resources +- **Documentation**: https://docs.livekit.io/agents/ (append `.md` for markdown format, or fetc `/llms.txt` at the root for a full index) +- **Extensive collection of practical examples**: https://github.com/livekit-examples/python-agents-examples +- **Frontend Starters**: [React](https://github.com/livekit-examples/agent-starter-react), [Swift](https://github.com/livekit-examples/agent-starter-swift), [Android](https://github.com/livekit-examples/ agent-starter-android), [Flutter](https://github.com/livekit-examples/agent-starter-flutter), [React Native](https://github.com/livekit-examples/agent-starter-react-native), and [Web Embed](https://github.com/livekit-examples/agent-starter-embed) templates available diff --git a/.cursor/rules/testing-patterns.mdc b/.cursor/rules/testing-patterns.mdc new file mode 100644 index 0000000..f7edf45 --- /dev/null +++ b/.cursor/rules/testing-patterns.mdc @@ -0,0 +1,154 @@ +--- +globs: "**/test_*.py,tests/*.py,**/tests/**/*.py" +description: "LiveKit Agent testing patterns and evaluation framework" +--- + +# LiveKit Agent Testing Patterns + +Use LiveKit's evaluation-based testing framework from [tests/test_agent.py](mdc:tests/test_agent.py). + +## Core Testing Pattern + +### Basic Agent Test Structure +```python +@pytest.mark.asyncio +async def test_conversational_behavior(): + """Test agent behavior with LLM-based evaluation.""" + async with AgentSession(llm=test_llm) as session: + await session.start(YourAgent()) + + result = await session.run(user_input="Test scenario input") + + await result.expect.next_event().is_message(role="assistant").judge( + llm=judge_llm, + intent="Expected behavioral outcome description" + ) +``` + +### Key Testing Guidelines + +#### Test Categories to Implement +- **Expected Behavior**: Core functionality works correctly +- **Tool Usage**: Function calls with proper arguments +- **Error Handling**: Graceful failure responses +- **Factual Grounding**: Accurate information, admits unknowns +- **Misuse Resistance**: Refuses inappropriate requests + +#### Evaluation with `.judge()` +- Use descriptive `intent` parameters for LLM evaluation +- Test both successful and error conditions +- Verify tool calls happen when expected +- Check response quality and appropriateness + +## Testing Patterns + +### Tool Testing with Mocks +Test error conditions and edge cases: +```python +@pytest.mark.asyncio +async def test_tool_error_handling(): + """Test graceful handling of tool errors.""" + def mock_failing_tool(): + raise Exception("Simulated error") + + with mock_tools(YourAgent, {"external_service": mock_failing_tool}): + async with AgentSession(llm=test_llm) as session: + await session.start(YourAgent()) + + result = await session.run(user_input="Test tool failure scenario") + + await result.expect.next_event().is_message(role="assistant").judge( + llm=judge_llm, intent="Gracefully handles service unavailability" + ) +``` + +### Conversation Flow Testing +Test multi-turn interactions: +```python +@pytest.mark.asyncio +async def test_conversation_flow(): + """Test multi-turn conversation handling.""" + async with AgentSession(llm=test_llm) as session: + await session.start(YourAgent()) + + # First interaction + result1 = await session.run(user_input="Initial greeting") + await result1.expect.next_event().is_message(role="assistant").judge( + llm=judge_llm, intent="Responds appropriately to greeting" + ) + + # Follow-up interaction + result2 = await session.run(user_input="Follow-up request") + await result2.expect.next_event().is_message(role="assistant").judge( + llm=judge_llm, intent="Maintains context from previous interaction" + ) +``` + +### Tool Call Verification +Test that tools are called with correct parameters: +```python +@pytest.mark.asyncio +async def test_tool_parameters(): + """Test tool is called with correct parameters.""" + tool_calls = [] + + def mock_api_tool(query: str): + tool_calls.append(query) + return f"API response for: {query}" + + with mock_tools(YourAgent, {"external_api": mock_api_tool}): + async with AgentSession(llm=test_llm) as session: + await session.start(YourAgent()) + + result = await session.run(user_input="Test query requiring API call") + + # Verify tool was called + assert len(tool_calls) > 0 + assert tool_calls[0] # Verify parameter was passed + + await result.expect.next_event().is_message(role="assistant").judge( + llm=judge_llm, intent="Uses API response appropriately" + ) +``` + +## Test Execution Commands + +### Running Tests +- `uv run pytest` - Run full test suite including evaluations +- `uv run pytest tests/test_agent.py` - Run specific test file +- `uv run pytest tests/test_agent.py::test_specific` - Run specific test +- `uv run pytest -v` - Verbose output with test names +- `uv run pytest -s` - Show print statements and logs + +### Test Environment Setup +Ensure proper test environment: +```python +# Required imports for testing +import pytest +from livekit.agents import Agent, AgentSession, RunContext +from livekit.agents.integrations import openai # or your chosen provider +from livekit.agents.testing import mock_tools + +# Your agent import +from your_module import YourAgent +``` + +### Debugging Tests +- Use `logger.info()` in agent code for debugging +- Add `print()` statements in test functions (run with `-s`) +- Check test output for LLM evaluation details +- Verify mock tools are called as expected + +## Best Practices + +### Test Design +- Write specific, focused tests for individual features +- Use descriptive test names and docstrings +- Test both success and failure scenarios +- Mock external dependencies for reliable tests + +### LLM Evaluation +- Write clear, specific intent descriptions +- Test edge cases and boundary conditions +- Verify appropriate refusals for inappropriate requests +- Check factual accuracy where applicable diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..b772a1f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,307 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with LiveKit Agent projects in Python. + +## Project Overview + +This covers multimodal AI agent development with LiveKit Agents, a realtime framework for production-grade voice, text, and vision AI agents. While this guide focuses on Python development, LiveKit also supports Node.js (beta). The concepts and patterns described here apply to building, extending, and improving LiveKit-based conversational AI agents across multiple platforms and use cases. + +## Development Commands + +### Environment Setup +- `uv sync` - Install dependencies to virtual environment +- `uv sync --dev` - Install dependencies including dev tools (pytest, ruff) +- Copy `.env.example` to `.env` and configure API keys +- `lk app env -w .env` - Auto-load LiveKit environment using CLI + +### Running Agents +- `uv run python download-files` - Download required models (Silero VAD, LiveKit turn detector) before first run +- `uv run python console` - Run agent in terminal for direct interaction +- `uv run python dev` - Run agent for frontend/telephony integration +- `uv run python start` - Production mode + +### Code Quality +- `uv run ruff check .` - Run linter +- `uv run ruff format .` - Format code +- `uv run ruff check --output-format=github .` - Lint with GitHub Actions format +- `uv run ruff format --check --diff .` - Check formatting without applying changes + +### Testing +- `uv run pytest` - Run full test suite including evaluations +- `uv run pytest ::` - Run specific test + +## Architecture Concepts + +### Core Components +- **Agent Implementation** - Main agent class inheriting from `Agent` base class +- **Agent Instructions** - System prompts and behavior definitions for the conversational AI +- **Function Tools** - Methods decorated with `@function_tool` that extend agent capabilities +- **Entrypoint Function** - Sets up the voice AI pipeline with STT/LLM/TTS components + +### Multimodal AI Pipeline Architecture +LiveKit agents use a modular pipeline approach with swappable components: +- **STT (Speech-to-Text)**: Converts audio input to text transcripts +- **LLM (Large Language Model)**: Processes conversations, text, and vision inputs to generate responses +- **TTS (Text-to-Speech)**: Converts text responses back to synthesized speech +- **Vision Processing**: Handles image and video understanding for multimodal interactions +- **Turn Detection**: Determines when users finish speaking for natural conversation flow +- **VAD (Voice Activity Detection)**: Detects when users are speaking vs silent +- **Background Audio Handling**: Manages background audio and interruption scenarios +- **Interrupt Management**: Handles conversation interruptions and context switching +- **Noise Cancellation**: Optional audio enhancement (LiveKit Cloud BVC or self-hosted alternatives) +- **Real-time Audio/Video Processing**: Low-latency multimedia stream handling + +### Testing Framework Concepts +LiveKit Agents provide evaluation-based testing: +- **AgentSession**: Test harness that simulates real conversations with LLM interactions +- **LLM-based Evaluation**: `.judge()` method evaluates agent responses against intent descriptions +- **Mock Tools**: Enable testing of error conditions and external integrations +- **End-to-End Testing**: Full conversation flow validation with real AI providers + +### Configuration Patterns +- **Environment Variables**: Store API keys and configuration separately from code +- **Provider Abstraction**: Swap AI providers without changing core agent logic +- **Modular Setup**: Configure pipeline components independently in entrypoint functions + +### Function Tools Pattern +Functions decorated with `@function_tool` extend agent capabilities: +- **Async Methods**: All tools are async methods on the Agent class +- **Structured Documentation**: Docstrings provide tool descriptions and argument specifications for LLM understanding +- **External Integration**: Connect agents to APIs, databases, computations, and other services +- **Natural Language Interface**: LLM decides when and how to use tools based on conversation context + +### Metrics and Observability +- **Automatic Metrics Collection**: Built-in tracking of STT/LLM/TTS performance and usage +- **Event-Driven Logging**: `MetricsCollectedEvent` handlers for custom analytics +- **Usage Summaries**: Session-level statistics and resource consumption tracking +- **Contextual Logging**: Room and session context automatically included in log entries + +## Key Development Patterns + +### Agent Customization Approach +To modify agent behavior: +1. **Update Instructions**: Modify system prompts and behavioral guidelines +2. **Add Function Tools**: Implement `@function_tool` methods for custom capabilities +3. **Swap AI Providers**: Configure different STT/LLM/TTS providers in session setup +4. **Configure Pipeline**: Adjust turn detection, VAD, and audio processing settings + +### Testing Strategy +1. **Unit Testing**: Test individual agent functions and tool behavior +2. **LLM Evaluation**: Use `.judge()` evaluations for response quality assessment +3. **Mock External Dependencies**: Test error conditions with `mock_tools()` +4. **Conversation Testing**: Validate full dialogue flows and user experience + +### Deployment Considerations +- **Production Readiness**: Container support with Dockerfile patterns +- **Dependency Management**: Use `uv` for reproducible Python environments +- **CI/CD Integration**: Automated linting, formatting, and testing workflows +- **Environment Configuration**: Secure API key management and environment-specific settings + +## LiveKit Documentation & Examples + +The LiveKit documentation is comprehensive and provides detailed guidance for all aspects of agent development. **All documentation URLs support `.md` suffix for markdown format** and the docs follow the **llms.txt standard** for AI-friendly consumption. + +**Core Documentation**: https://docs.livekit.io/agents/ +- **Quick Start**: https://docs.livekit.io/agents/start/voice-ai/ +- **Building Agents**: https://docs.livekit.io/agents/build/ +- **Integrations**: https://docs.livekit.io/agents/integrations/ +- **Operations & Deployment**: https://docs.livekit.io/agents/ops/ + +**Practical Examples Repository**: https://github.com/livekit-examples/python-agents-examples +- Contains dozens of real-world agent implementations +- Advanced patterns and use cases beyond starter templates +- Integration examples with various AI providers and tools +- Production-ready code samples + +## AI Provider Integration Patterns + +### LLM Provider Abstraction ([docs](https://docs.livekit.io/agents/integrations/llm/)) +All LLM providers follow consistent interfaces for easy swapping: +- **OpenAI**: `openai.LLM(model="gpt-4o-mini")` ([docs](https://docs.livekit.io/agents/integrations/llm/openai/)) +- **Anthropic**: `anthropic.LLM(model="claude-3-haiku")` ([docs](https://docs.livekit.io/agents/integrations/llm/anthropic/)) +- **Google Gemini**: `google.LLM(model="gemini-1.5-flash")` ([docs](https://docs.livekit.io/agents/integrations/llm/google/)) +- **Azure OpenAI**: `azure_openai.LLM(model="gpt-4o")` ([docs](https://docs.livekit.io/agents/integrations/llm/azure-openai/)) +- **Groq**: `groq.LLM()` ([docs](https://docs.livekit.io/agents/integrations/llm/groq/)) +- **Fireworks**: `fireworks.LLM()` ([docs](https://docs.livekit.io/agents/integrations/llm/fireworks/)) +- **DeepSeek**: `deepseek.LLM()` ([docs](https://docs.livekit.io/agents/integrations/llm/deepseek/)) +- **Cerebras**: `cerebras.LLM()` ([docs](https://docs.livekit.io/agents/integrations/llm/cerebras/)) +- **Amazon Bedrock**: `bedrock.LLM()` ([docs](https://docs.livekit.io/agents/integrations/llm/bedrock/)) +- **And others**: Additional providers regularly added + +### STT Provider Options ([docs](https://docs.livekit.io/agents/integrations/stt/)) +All support low-latency multilingual transcription: +- **Deepgram**: `deepgram.STT(model="nova-3", language="multi")` ([docs](https://docs.livekit.io/agents/integrations/stt/deepgram/)) +- **AssemblyAI**: `assemblyai.STT()` ([docs](https://docs.livekit.io/agents/integrations/stt/assemblyai/)) +- **Azure AI Speech**: `azure_ai_speech.STT()` ([docs](https://docs.livekit.io/agents/integrations/stt/azure-ai-speech/)) +- **Google Cloud**: `google.STT()` ([docs](https://docs.livekit.io/agents/integrations/stt/google/)) +- **OpenAI**: `openai.STT()` ([docs](https://docs.livekit.io/agents/integrations/stt/openai/)) + +### TTS Provider Selection ([docs](https://docs.livekit.io/agents/integrations/tts/)) +High-quality, low-latency voice synthesis options: +- **Cartesia**: `cartesia.TTS(model="sonic-english")` ([docs](https://docs.livekit.io/agents/integrations/tts/cartesia/)) +- **ElevenLabs**: `elevenlabs.TTS()` ([docs](https://docs.livekit.io/agents/integrations/tts/elevenlabs/)) +- **Azure AI Speech**: `azure_ai_speech.TTS()` ([docs](https://docs.livekit.io/agents/integrations/tts/azure-ai-speech/)) +- **Amazon Polly**: `polly.TTS()` ([docs](https://docs.livekit.io/agents/integrations/tts/polly/)) +- **Google Cloud**: `google.TTS()` ([docs](https://docs.livekit.io/agents/integrations/tts/google/)) + +## Alternative Pipeline Architectures + +### OpenAI Realtime API Integration ([docs](https://docs.livekit.io/agents/integrations/realtime/openai)) +Replace entire STT-LLM-TTS pipeline with single provider: +```python +session = AgentSession( + llm=openai.realtime.RealtimeModel( + model="gpt-4o-realtime-preview", + voice="alloy", + temperature=0.8, + ) +) +``` +- **Built-in VAD**: Server or semantic turn detection modes +- **Lower Latency**: Single-provider processing reduces round-trip time +- **Unified Processing**: Supports both audio and text processing in one model + +### Advanced Turn Detection ([docs](https://docs.livekit.io/agents/build/turns/turn-detector/)) +**LiveKit Turn Detector Models**: +- **English Model**: `EnglishModel()` (66MB, ~15-45ms per turn) +- **Multilingual Model**: `MultilingualModel()` (281MB, ~50-160ms, 14 languages) +- **Enhanced Context**: Adds conversational understanding to VAD for better end-of-turn detection + +## Function Tools and Capability Extension + +### Tool Implementation Patterns +Functions decorated with `@function_tool` become available to the LLM: +```python +@function_tool +async def external_integration(self, context: RunContext, parameter: str): + """Description of what this tool does for the LLM. + + Args: + parameter: Clear description for LLM understanding + """ + # Implementation logic (APIs, databases, computations, etc.) + return "structured result or simple string" +``` + +### Best Practices for Tool Development +- **Async Implementation**: All tools should be async methods +- **Clear Documentation**: Docstrings guide LLM understanding and usage +- **Error Handling**: Graceful failure with informative error messages +- **Simple Returns**: Return strings or simple structured data +- **External Integration**: Connect to APIs, databases, or other services +- **Contextual Logging**: Use `logger.info()` for debugging and monitoring + +## Testing and Evaluation Strategies ([docs](https://docs.livekit.io/agents/build/testing/)) + +### LLM-Based Test Evaluation +Use LiveKit's evaluation framework for intelligent testing: +```python +@pytest.mark.asyncio +async def test_agent_capability(): + async with AgentSession(llm=openai.LLM()) as session: + await session.start(YourAgent()) + result = await session.run(user_input="Test query") + + await result.expect.next_event().is_message(role="assistant").judge( + llm, intent="Description of expected behavior" + ) +``` + +### Mock Tool Testing Patterns +Test error conditions and edge cases: +```python +with mock_tools(YourAgent, {"tool_name": lambda: "mocked_response"}): + result = await session.run(user_input="test input") +``` + +### Comprehensive Test Categories +- **Core Functionality**: Primary agent capabilities work correctly +- **Tool Integration**: Function calls with proper arguments and responses +- **Error Scenarios**: Graceful handling of failures and edge cases +- **Information Accuracy**: Factual grounding and admission of limitations +- **Safety & Ethics**: Appropriate refusal of inappropriate requests + +## Metrics and Performance Monitoring ([docs](https://docs.livekit.io/agents/build/metrics/)) + +### Automatic Metrics Collection +Built-in tracking includes: +- **STT Performance**: Audio duration, transcript timing, streaming efficiency +- **LLM Metrics**: Response time, token usage, time-to-first-token (TTFT) +- **TTS Efficiency**: Audio generation time, character processing, output duration + +### Custom Metrics Implementation +```python +@session.on("metrics_collected") +def handle_metrics(ev: MetricsCollectedEvent): + # Process built-in metrics + metrics.log_metrics(ev.metrics) + # Add custom analytics + custom_tracker.record(ev.metrics) +``` + +### Usage Analytics Patterns +```python +usage_collector = metrics.UsageCollector() +# Collect metrics throughout session lifecycle +final_summary = usage_collector.get_summary() # Session statistics +``` + +## Frontend Integration Strategies ([docs](https://docs.livekit.io/agents/start/frontend/)) + +### Ready-to-Use Starter Templates +Complete application templates with full source code: +- **Web Applications**: React/Next.js implementations +- **Mobile Apps**: iOS/Swift, Android/Kotlin, Flutter, React Native +- **Embedded Solutions**: Web widget and iframe integrations + +### Custom Frontend Development Patterns +- **LiveKit SDK Integration**: Use platform-specific SDKs for real-time connectivity +- **Audio/Video Streaming**: Subscribe to agent tracks and transcription streams +- **WebRTC Implementation**: Handle real-time communication protocols with NAT traversal +- **Enhanced UX Features**: + - Audio visualizers and waveform displays + - Virtual avatars and character animations + - Custom UI controls and interaction patterns + - Real-time transcription overlays + - Visual feedback for agent processing states + - Interactive chat interfaces alongside voice +- **Cross-Platform Support**: Consistent experience across web, mobile, and desktop + +## Workflow Modeling and Advanced Features + +### Workflow Modeling ([docs](https://docs.livekit.io/agents/build/workflows/)) +LiveKit supports sophisticated workflow modeling for complex agent behaviors: +- **State Management**: Define agent states and transitions +- **Conditional Logic**: Implement branching conversation flows +- **Context Preservation**: Maintain conversation context across workflow steps +- **Error Recovery**: Handle failures and provide graceful fallbacks +- **Multi-Step Processes**: Guide users through complex tasks + +### Background Processing Capabilities +- **Background Audio Handling**: Process audio while maintaining conversation flow +- **Parallel Task Execution**: Handle multiple operations simultaneously +- **Context Switching**: Seamlessly transition between different conversation topics +- **Asynchronous Operations**: Non-blocking external API calls and computations + +## Advanced Integration Capabilities + +### Telephony Integration ([docs](https://docs.livekit.io/agents/start/telephony/)) +Add voice calling capabilities with SIP integration: +- **Inbound/Outbound Calling**: Handle phone-based interactions +- **SIP Protocol Support**: Industry-standard telephony integration +- **Call Management**: Handle call routing, transfers, and conferencing +- **Phone Number Provisioning**: Manage virtual phone numbers for agents + +### Production Deployment ([docs](https://docs.livekit.io/agents/ops/deployment/)) +- **LiveKit Cloud**: Managed hosting with enterprise features +- **Self-Hosting**: Container-based deployment with provided Docker configurations +- **Kubernetes Support**: Production-grade orchestration and scaling +- **Scaling Strategies**: Handle multiple concurrent sessions and load balancing +- **Security Configuration**: API key management and access control +- **High Availability**: Multi-region deployment and failover capabilities + +### Environment Configuration Standards +Required environment variables for different provider integrations: +- **Core LiveKit**: `LIVEKIT_URL`, `LIVEKIT_API_KEY`, `LIVEKIT_API_SECRET` +- **AI Providers**: Provider-specific API keys (e.g., `OPENAI_API_KEY`, `DEEPGRAM_API_KEY`) +- **Configuration Management**: Use `.env` files and secure secret management \ No newline at end of file