Prompt Caching
Supported Providers:
- OpenAI (
openai/) - Anthropic API (
anthropic/) - Google AI Studio (
gemini/) - Vertex AI (
vertex_ai/,vertex_ai_beta/) - Bedrock (
bedrock/,bedrock/invoke/,bedrock/converse) (All models bedrock supports prompt caching on) - Deepseek API (
deepseek/)
For the supported providers, LiteLLM follows the OpenAI prompt caching usage object format:
"usage": {
"prompt_tokens": 2006,
"completion_tokens": 300,
"total_tokens": 2306,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
# ANTHROPIC_ONLY #
"cache_creation_input_tokens": 0
}
prompt_tokens: These are all prompt tokens including cache-miss and cache-hit input tokens.completion_tokens: These are the output tokens generated by the model.total_tokens: Sum of prompt_tokens + completion_tokens.prompt_tokens_details: Object containing cached_tokens.cached_tokens: Tokens that were a cache-hit for that call.
completion_tokens_details: Object containing reasoning_tokens.- ANTHROPIC_ONLY:
cache_creation_input_tokensare the number of tokens that were written to cache. (Anthropic charges for this).
Quick Start​
Note: OpenAI caching is only available for prompts containing 1024 tokens or more
- SDK
- PROXY
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = ""
for _ in range(2):
response = completion(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)
print("response=", response)
print("response.usage=", response.usage)
assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0
- Setup config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
from openai import OpenAI
import os
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
for _ in range(2):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)
print("response=", response)
print("response.usage=", response.usage)
assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0