-
-
Notifications
You must be signed in to change notification settings - Fork 189
Prompt caching for Claude #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tpaulshippy
wants to merge
13
commits into
crmne:main
Choose a base branch
from
tpaulshippy:prompt-caching
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4,968
−132
Open
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
2e84006
13: Failing specs
tpaulshippy be61e48
13: Get caching specs passing for Bedrock
tpaulshippy edec138
13: Remove comments in specs
tpaulshippy 971f176
13: Add unused param on other providers
tpaulshippy 557a5ee
13: Rubocop -A
tpaulshippy 9673b13
13: Add cassettes for bedrock cache specs
tpaulshippy c47d270
13: Resolve Rubocop aside from Metrics/ParameterLists
tpaulshippy eaf0876
13: Use large enough prompt to hit cache meaningfully
tpaulshippy 160d9ab
13: Ensure cache tokens are being used
tpaulshippy d1698bf
13: Refactor completion parameters
tpaulshippy 344729f
16: Add guide for prompt caching
tpaulshippy 7b98277
Add real anthropic cassettes ($0.03)
tpaulshippy fd30f14
Merge branch 'main' into prompt-caching
tpaulshippy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,398 @@ | ||
--- | ||
layout: default | ||
title: Prompt Caching | ||
parent: Guides | ||
nav_order: 11 | ||
permalink: /guides/prompt-caching | ||
--- | ||
|
||
# Prompt Caching | ||
{: .no_toc } | ||
|
||
Prompt caching allows you to cache frequently used content like system instructions, large documents, or tool definitions to reduce costs and improve response times for subsequent requests. | ||
{: .fs-6 .fw-300 } | ||
|
||
## Table of contents | ||
{: .no_toc .text-delta } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
--- | ||
|
||
After reading this guide, you will know: | ||
|
||
* What prompt caching is and when to use it. | ||
* Which models and providers support prompt caching. | ||
* How to cache system instructions, user messages, and tool definitions. | ||
* How to track caching costs and token usage. | ||
* Best practices for maximizing cache efficiency. | ||
|
||
## What is Prompt Caching? | ||
|
||
Prompt caching allows AI providers to store and reuse parts of your prompts across multiple requests. When you mark content for caching, the provider stores it in a cache and can reuse it in subsequent requests without reprocessing, leading to: | ||
|
||
- **Cost savings**: Cached content is typically charged at 75-90% less than regular input tokens | ||
- **Faster responses**: Cached content doesn't need to be reprocessed | ||
- **Consistent context**: Large documents or instructions remain available across conversations | ||
|
||
{: .note } | ||
Prompt caching is currently supported in RubyLLM only for **Anthropic** and **Bedrock** (Anthropic models) providers. The cache is ephemeral and will not be available if not used after 5 minutes by default. | ||
|
||
Different models have different minimum numbers of tokens before caching kicks in but it usually takes around 1024 tokens worth of content. | ||
|
||
## Basic Usage | ||
|
||
Enable prompt caching using the `cache_prompts` method on your chat instance: | ||
|
||
```ruby | ||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
|
||
# Enable caching for different types of content | ||
chat.cache_prompts( | ||
system: true, # Cache system instructions | ||
user: true, # Cache user messages | ||
tools: true # Cache tool definitions | ||
) | ||
``` | ||
|
||
## Caching System Instructions | ||
|
||
System instructions are ideal for caching when you have lengthy guidelines, documentation, or context that remains consistent across multiple conversations. | ||
|
||
```ruby | ||
# Large system prompt that would benefit from caching | ||
CODING_GUIDELINES = <<~INSTRUCTIONS | ||
You are a senior Ruby developer and code reviewer. Follow these detailed guidelines: | ||
|
||
## Code Style Guidelines | ||
- Use 2 spaces for indentation, never tabs | ||
- Keep lines under 120 characters | ||
- Use descriptive variable and method names | ||
- Prefer explicit returns in methods | ||
- Use single quotes for strings unless interpolation is needed | ||
|
||
## Architecture Principles | ||
- Follow SOLID principles | ||
- Prefer composition over inheritance | ||
- Keep controllers thin, move logic to models or services | ||
- Use dependency injection for better testability | ||
|
||
## Testing Requirements | ||
- Write RSpec tests for all new functionality | ||
- Aim for 90%+ test coverage | ||
- Use factories instead of fixtures | ||
- Mock external dependencies | ||
|
||
## Security Considerations | ||
- Always validate and sanitize user input | ||
- Use strong parameters in controllers | ||
- Implement proper authentication and authorization | ||
- Never commit secrets or API keys | ||
|
||
## Performance Guidelines | ||
- Avoid N+1 queries, use includes/joins | ||
- Index database columns used in WHERE clauses | ||
- Use background jobs for long-running tasks | ||
- Cache expensive computations | ||
|
||
[... additional detailed guidelines ...] | ||
INSTRUCTIONS | ||
|
||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
chat.with_instructions(CODING_GUIDELINES) | ||
chat.cache_prompts(system: true) | ||
|
||
# First request creates the cache | ||
response = chat.ask("Review this Ruby method for potential improvements") | ||
puts "Cache creation tokens: #{response.cache_creation_input_tokens}" | ||
|
||
# Subsequent requests use the cached instructions | ||
response = chat.ask("What are the testing requirements for this project?") | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Caching Large Documents | ||
|
||
When working with large documents, user message caching can significantly reduce costs: | ||
|
||
```ruby | ||
# Load a large document (e.g., API documentation, legal contract, research paper) | ||
large_document = File.read('path/to/large_api_documentation.md') | ||
|
||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet-20241022') | ||
chat.cache_prompts(user: true) | ||
|
||
# First request with the large document | ||
prompt = <<~PROMPT | ||
#{large_document} | ||
|
||
Based on the API documentation above, how do I authenticate with this service? | ||
PROMPT | ||
|
||
response = chat.ask(prompt) | ||
puts "Document cached. Creation tokens: #{response.cache_creation_input_tokens}" | ||
|
||
# Follow-up questions can reference the cached document | ||
response = chat.ask("What are the rate limits for this API?") | ||
puts "Using cached document. Read tokens: #{response.cache_read_input_tokens}" | ||
|
||
response = chat.ask("Show me an example of making a POST request to create a user") | ||
puts "Still using cache. Read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Caching Tool Definitions | ||
|
||
When using multiple complex tools, caching their definitions can reduce overhead: | ||
|
||
```ruby | ||
# Define complex tools with detailed descriptions | ||
class DatabaseQueryTool < RubyLLM::Tool | ||
description <<~DESC | ||
Execute SQL queries against the application database. This tool provides access to: | ||
|
||
- User management tables (users, profiles, permissions) | ||
- Content tables (posts, comments, media) | ||
- Analytics tables (events, metrics, reports) | ||
- Configuration tables (settings, features, experiments) | ||
|
||
Security notes: | ||
- Only SELECT queries are allowed | ||
- Results are limited to 1000 rows | ||
- Sensitive columns are automatically filtered | ||
- All queries are logged for audit purposes | ||
|
||
Usage examples: | ||
- Find active users: "SELECT * FROM users WHERE status = 'active'" | ||
- Get recent posts: "SELECT * FROM posts WHERE created_at > NOW() - INTERVAL 7 DAY" | ||
- Analyze user engagement: "SELECT COUNT(*) FROM events WHERE event_type = 'login'" | ||
DESC | ||
|
||
parameter :query, type: 'string', description: 'SQL query to execute' | ||
parameter :limit, type: 'integer', description: 'Maximum number of rows to return (default: 100)' | ||
|
||
def execute(query:, limit: 100) | ||
# Implementation here | ||
{ results: [], count: 0 } | ||
end | ||
end | ||
|
||
class FileSystemTool < RubyLLM::Tool | ||
description <<~DESC | ||
Access and manipulate files in the application directory. Capabilities include: | ||
|
||
- Reading file contents (text files only) | ||
- Listing directory contents | ||
- Searching for files by name or pattern | ||
- Getting file metadata (size, modified date, permissions) | ||
|
||
Restrictions: | ||
- Cannot access files outside the application directory | ||
- Cannot modify, create, or delete files | ||
- Binary files are not supported | ||
- Maximum file size: 10MB | ||
|
||
Supported file types: | ||
- Source code (.rb, .js, .py, .java, etc.) | ||
- Configuration files (.yml, .json, .xml, etc.) | ||
- Documentation (.md, .txt, .rst, etc.) | ||
- Log files (.log, .out, .err) | ||
DESC | ||
|
||
parameter :action, type: 'string', description: 'Action to perform: read, list, search, or info' | ||
parameter :path, type: 'string', description: 'File or directory path' | ||
parameter :pattern, type: 'string', description: 'Search pattern (for search action)' | ||
|
||
def execute(action:, path:, pattern: nil) | ||
# Implementation here | ||
{ action: action, path: path, result: 'success' } | ||
end | ||
end | ||
|
||
# Set up chat with tool caching | ||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
chat.with_tools(DatabaseQueryTool, FileSystemTool) | ||
chat.cache_prompts(tools: true) | ||
|
||
# First request creates the tool cache | ||
response = chat.ask("What tables are available in the database?") | ||
puts "Tools cached. Creation tokens: #{response.cache_creation_input_tokens}" | ||
|
||
# Subsequent requests use cached tool definitions | ||
response = chat.ask("Show me the structure of the users table") | ||
puts "Using cached tools. Read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Combining Multiple Cache Types | ||
|
||
You can cache different types of content simultaneously for maximum efficiency: | ||
|
||
```ruby | ||
# Large system context | ||
ANALYSIS_CONTEXT = <<~CONTEXT | ||
You are an expert data analyst working with e-commerce data. Your analysis should consider: | ||
|
||
## Business Metrics | ||
- Revenue and profit margins | ||
- Customer acquisition cost (CAC) | ||
- Customer lifetime value (CLV) | ||
- Conversion rates and funnel analysis | ||
|
||
## Data Quality Standards | ||
- Check for missing or inconsistent data | ||
- Validate data ranges and formats | ||
- Identify outliers and anomalies | ||
- Ensure temporal consistency | ||
|
||
## Reporting Guidelines | ||
- Use clear, business-friendly language | ||
- Include confidence intervals where appropriate | ||
- Highlight actionable insights | ||
- Provide recommendations with supporting evidence | ||
|
||
[... extensive analysis guidelines ...] | ||
CONTEXT | ||
|
||
# Load large dataset | ||
sales_data = File.read('path/to/large_sales_dataset.csv') | ||
|
||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet-20241022') | ||
chat.with_instructions(ANALYSIS_CONTEXT) | ||
chat.with_tools(DatabaseQueryTool, FileSystemTool) | ||
|
||
# Enable caching for all content types | ||
chat.cache_prompts(system: true, user: true, tools: true) | ||
|
||
# First request caches everything | ||
prompt = <<~PROMPT | ||
#{sales_data} | ||
|
||
Analyze the sales data above and provide insights on revenue trends. | ||
PROMPT | ||
|
||
response = chat.ask(prompt) | ||
puts "All content cached:" | ||
puts " System cache: #{response.cache_creation_input_tokens} tokens" | ||
puts " Tools cached: #{chat.messages.any? { |m| m.cache_creation_input_tokens&.positive? }}" | ||
|
||
# Follow-up requests benefit from all cached content | ||
response = chat.ask("What are the top-performing product categories?") | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
|
||
response = chat.ask("Query the database to get customer segmentation data") | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Understanding Cache Metrics | ||
|
||
RubyLLM provides detailed metrics about cache usage in the response: | ||
|
||
```ruby | ||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
chat.with_instructions("Large system prompt here...") | ||
chat.cache_prompts(system: true) | ||
|
||
response = chat.ask("Your question here") | ||
|
||
# Check if cache was created (first request) | ||
if response.cache_creation_input_tokens&.positive? | ||
puts "Cache created with #{response.cache_creation_input_tokens} tokens" | ||
puts "Regular input tokens: #{response.input_tokens - response.cache_creation_input_tokens}" | ||
end | ||
|
||
# Check if cache was used (subsequent requests) | ||
if response.cache_read_input_tokens&.positive? | ||
puts "Cache read: #{response.cache_read_input_tokens} tokens" | ||
puts "New input tokens: #{response.input_tokens - response.cache_read_input_tokens}" | ||
end | ||
|
||
# Total cost calculation (example with Claude pricing) | ||
cache_creation_cost = (response.cache_creation_input_tokens || 0) * 3.75 / 1_000_000 # $3.75 per 1M tokens | ||
cache_read_cost = (response.cache_read_input_tokens || 0) * 0.30 / 1_000_000 # $0.30 per 1M tokens | ||
regular_input_cost = (response.input_tokens - (response.cache_creation_input_tokens || 0) - (response.cache_read_input_tokens || 0)) * 3.00 / 1_000_000 | ||
output_cost = response.output_tokens * 15.00 / 1_000_000 | ||
|
||
total_cost = cache_creation_cost + cache_read_cost + regular_input_cost + output_cost | ||
puts "Total request cost: $#{total_cost.round(6)}" | ||
``` | ||
|
||
## Cost Optimization | ||
|
||
Prompt caching can significantly reduce costs for applications with repeated content: | ||
|
||
```ruby | ||
# Example cost comparison for Claude 3.5 Sonnet | ||
# Regular pricing: $3.00 per 1M input tokens | ||
# Cache creation: $3.75 per 1M tokens (25% premium) | ||
# Cache read: $0.30 per 1M tokens (90% discount) | ||
|
||
def calculate_savings(content_tokens, num_requests) | ||
# Without caching | ||
regular_cost = content_tokens * num_requests * 3.00 / 1_000_000 | ||
|
||
# With caching | ||
cache_creation_cost = content_tokens * 3.75 / 1_000_000 | ||
cache_read_cost = content_tokens * (num_requests - 1) * 0.30 / 1_000_000 | ||
cached_cost = cache_creation_cost + cache_read_cost | ||
|
||
savings = regular_cost - cached_cost | ||
savings_percentage = (savings / regular_cost * 100).round(1) | ||
|
||
puts "Content: #{content_tokens} tokens, #{num_requests} requests" | ||
puts "Regular cost: $#{regular_cost.round(4)}" | ||
puts "Cached cost: $#{cached_cost.round(4)}" | ||
puts "Savings: $#{savings.round(4)} (#{savings_percentage}%)" | ||
end | ||
|
||
# Examples | ||
calculate_savings(5000, 10) # 5K tokens, 10 requests | ||
calculate_savings(20000, 5) # 20K tokens, 5 requests | ||
calculate_savings(50000, 3) # 50K tokens, 3 requests | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
### Cache Not Working | ||
If caching doesn't seem to be working: | ||
|
||
1. **Check model support**: Ensure you're using a supported model | ||
2. **Verify provider**: Only Anthropic and Bedrock support caching | ||
3. **Check content size**: Smaller content will not be cached - there is a minimum that varies per model | ||
4. **Monitor metrics**: Use `cache_creation_input_tokens` and `cache_read_input_tokens` | ||
|
||
```ruby | ||
response = chat.ask("Your question") | ||
|
||
if response.cache_creation_input_tokens.zero? && response.cache_read_input_tokens.zero? | ||
puts "No caching occurred. Check:" | ||
puts " Model: #{chat.model.id}" | ||
puts " Provider: #{chat.model.provider}" | ||
puts " Cache settings: #{chat.instance_variable_get(:@cache_prompts)}" | ||
end | ||
``` | ||
|
||
### Unexpected Cache Behavior | ||
Cache behavior can vary based on: | ||
|
||
- **Content changes**: Any modification invalidates the cache | ||
- **Cache expiration**: Caches are ephemeral and expire automatically | ||
- **Provider limits**: Each provider has different cache policies | ||
|
||
```ruby | ||
# Cache is invalidated by any content change | ||
chat.with_instructions("Original instructions") | ||
chat.cache_prompts(system: true) | ||
response1 = chat.ask("Question 1") # Creates cache | ||
|
||
chat.with_instructions("Modified instructions", replace: true) | ||
response2 = chat.ask("Question 2") # Creates new cache (old one invalidated) | ||
``` | ||
|
||
## What's Next? | ||
|
||
Now that you understand prompt caching, explore these related topics: | ||
|
||
* [Working with Models]({% link guides/models.md %}) - Learn about model capabilities and selection | ||
* [Using Tools]({% link guides/tools.md %}) - Understand tool definitions that can be cached | ||
* [Error Handling]({% link guides/error-handling.md %}) - Handle caching-related errors gracefully | ||
* [Rails Integration]({% link guides/rails.md %}) - Use caching in Rails applications |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel like it needs its own guide. It's a small feature that can be added in the chat guide.