Prompt Caching

Supported Providers:

OpenAI (openai/)
Anthropic API (anthropic/)
Google AI Studio (gemini/)
Vertex AI (vertex_ai/, vertex_ai_beta/)
Bedrock (bedrock/, bedrock/invoke/, bedrock/converse) (All models bedrock supports prompt caching on)
Deepseek API (deepseek/)

For the supported providers, LiteLLM follows the OpenAI prompt caching usage object format:

"usage": {
  "prompt_tokens": 2006,
  "completion_tokens": 300,
  "total_tokens": 2306,
  "prompt_tokens_details": {
    "cached_tokens": 1920
  },
  "completion_tokens_details": {
    "reasoning_tokens": 0
  }
  # ANTHROPIC_ONLY #
  "cache_creation_input_tokens": 0
}

prompt_tokens: These are all prompt tokens including cache-miss and cache-hit input tokens.
completion_tokens: These are the output tokens generated by the model.
total_tokens: Sum of prompt_tokens + completion_tokens.
prompt_tokens_details: Object containing cached_tokens.
- cached_tokens: Tokens that were a cache-hit for that call.
completion_tokens_details: Object containing reasoning_tokens.
ANTHROPIC_ONLY: cache_creation_input_tokens are the number of tokens that were written to cache. (Anthropic charges for this).

Quick Start

Note: OpenAI caching is only available for prompts containing 1024 tokens or more

SDK
PROXY

from litellm import completion 
import os

os.environ["OPENAI_API_KEY"] = ""

for _ in range(2):
    response = completion(
        model="gpt-4o",
        messages=[
            # System Message
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "Here is the full text of a complex legal agreement"
                        * 400,
                    }
                ],
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What are the key terms and conditions in this agreement?",
                    }
                ],
            },
            {
                "role": "assistant",
                "content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What are the key terms and conditions in this agreement?",
                    }
                ],
            },
        ],
        temperature=0.2,
        max_tokens=10,
    )

print("response=", response)
print("response.usage=", response.usage)

assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0

Setup config.yaml

model_list:
    - model_name: gpt-4o
      litellm_params:
        model: openai/gpt-4o
        api_key: os.environ/OPENAI_API_KEY

Start proxy

litellm --config /path/to/config.yaml

Test it!

from openai import OpenAI
import os

client = OpenAI(
    api_key="LITELLM_PROXY_KEY", # sk-1234
    base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)

for _ in range(2):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            # System Message
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "Here is the full text of a complex legal agreement"
                        * 400,
                    }
                ],
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What are the key terms and conditions in this agreement?",
                    }
                ],
            },
            {
                "role": "assistant",
                "content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What are the key terms and conditions in this agreement?",
                    }
                ],
            },
        ],
        temperature=0.2,
        max_tokens=10,
    )

print("response=", response)
print("response.usage=", response.usage)

assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0

OpenAI `prompt_cache_key` and `prompt_cache_retention`

OpenAI prompt caching is automatic — no cache_control message annotations are needed. Any request with 1024+ prompt tokens is eligible for caching.

OpenAI also supports two optional parameters for more control over caching behavior:

prompt_cache_key (string) — A routing hint that improves cache hit rates for requests sharing long common prefixes. Requests with the same cache key are routed to the same backend, increasing the likelihood of a cache hit.
prompt_cache_retention ("in_memory" or "24h") — Controls cache TTL. Default is "in_memory" (5–10 min). Set to "24h" for extended caching that offloads KV tensors to GPU-local storage.

SDK
PROXY

from litellm import completion
import os

os.environ["OPENAI_API_KEY"] = ""

response = completion(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant tasked with analyzing legal documents. "
            + "Here is the full text of a complex legal agreement " * 400,
        },
        {
            "role": "user",
            "content": "What are the key terms and conditions?",
        },
    ],
    prompt_cache_key="legal-doc-analysis",
    prompt_cache_retention="24h",
)
print(response.usage)

from openai import OpenAI

client = OpenAI(
    api_key="LITELLM_PROXY_KEY",
    base_url="LITELLM_PROXY_BASE",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant tasked with analyzing legal documents. "
            + "Here is the full text of a complex legal agreement " * 400,
        },
        {
            "role": "user",
            "content": "What are the key terms and conditions?",
        },
    ],
    extra_body={
        "prompt_cache_key": "legal-doc-analysis",
        "prompt_cache_retention": "24h",
    },
)
print(response.usage)

Anthropic Example

Anthropic charges for cache writes.

Specify the content to cache with "cache_control": {"type": "ephemeral"}.

This same format also works for Gemini / Vertex AI. For other providers, it will be ignored.

SDK
PROXY

from litellm import completion 
import litellm 
import os 

litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = "" 

response = completion(
    model="anthropic/claude-3-5-sonnet-20240620",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ]
)

print(response.usage)

Setup config.yaml

model_list:
    - model_name: claude-3-5-sonnet-20240620
      litellm_params:
        model: anthropic/claude-3-5-sonnet-20240620
        api_key: os.environ/ANTHROPIC_API_KEY

Start proxy

litellm --config /path/to/config.yaml

Test it!

from openai import OpenAI 
import os

client = OpenAI(
    api_key="LITELLM_PROXY_KEY", # sk-1234
    base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)

response = client.chat.completions.create(
    model="claude-3-5-sonnet-20240620",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ]
)

print(response.usage)

Google AI Studio / Vertex AI (Gemini) Example

Use the same Anthropic-style cache_control format — LiteLLM automatically translates it to Google's context caching API.

How it works under the hood:

Messages with cache_control are separated and sent to Google's cachedContents API
The cached content ID is then passed as cachedContent in the Gemini request body
Works across all three providers: gemini/ (Google AI Studio), vertex_ai/, and vertex_ai_beta/
Requires a minimum of 1024 tokens in the cached content — below that, caching is silently skipped

SDK
PROXY

from litellm import completion
import os

os.environ["GEMINI_API_KEY"] = ""

response = completion(
    model="gemini/gemini-2.5-flash",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ],
)

print(response.usage)

Setup config.yaml

model_list:
    - model_name: gemini-2.5-flash
      litellm_params:
        model: gemini/gemini-2.5-flash
        api_key: os.environ/GEMINI_API_KEY

Start proxy

litellm --config /path/to/config.yaml

Test it!

from openai import OpenAI

client = OpenAI(
    api_key="LITELLM_PROXY_KEY",  # sk-1234
    base_url="LITELLM_PROXY_BASE",  # http://0.0.0.0:4000
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ],
)

print(response.usage)

Vertex AI

For Vertex AI, use vertex_ai/ prefix:

SDK
PROXY

from litellm import completion

response = completion(
    model="vertex_ai/gemini-2.5-flash",
    vertex_project="my-gcp-project",
    vertex_location="us-central1",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ],
)

print(response.usage)

Setup config.yaml

model_list:
    - model_name: gemini-2.5-flash
      litellm_params:
        model: vertex_ai/gemini-2.5-flash
        vertex_project: my-gcp-project
        vertex_location: us-central1

Start proxy

litellm --config /path/to/config.yaml

Test it!

from openai import OpenAI

client = OpenAI(
    api_key="LITELLM_PROXY_KEY",  # sk-1234
    base_url="LITELLM_PROXY_BASE",  # http://0.0.0.0:4000
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ],
)

print(response.usage)

Deepeek Example

Works the same as OpenAI.

from litellm import completion 
import litellm
import os 

os.environ["DEEPSEEK_API_KEY"] = "" 

litellm.set_verbose = True # 👈 SEE RAW REQUEST

model_name = "deepseek/deepseek-chat"
messages_1 = [
    {
        "role": "system",
        "content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
    },
    {
        "role": "user",
        "content": "In what year did Qin Shi Huang unify the six states?",
    },
    {"role": "assistant", "content": "Answer: 221 BC"},
    {"role": "user", "content": "Who was the founder of the Han Dynasty?"},
    {"role": "assistant", "content": "Answer: Liu Bang"},
    {"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
    {"role": "assistant", "content": "Answer: Li Zhu"},
    {
        "role": "user",
        "content": "Who was the founding emperor of the Ming Dynasty?",
    },
    {"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
    {
        "role": "user",
        "content": "Who was the founding emperor of the Qing Dynasty?",
    },
]

message_2 = [
    {
        "role": "system",
        "content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
    },
    {
        "role": "user",
        "content": "In what year did Qin Shi Huang unify the six states?",
    },
    {"role": "assistant", "content": "Answer: 221 BC"},
    {"role": "user", "content": "Who was the founder of the Han Dynasty?"},
    {"role": "assistant", "content": "Answer: Liu Bang"},
    {"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
    {"role": "assistant", "content": "Answer: Li Zhu"},
    {
        "role": "user",
        "content": "Who was the founding emperor of the Ming Dynasty?",
    },
    {"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
    {"role": "user", "content": "When did the Shang Dynasty fall?"},
]

response_1 = litellm.completion(model=model_name, messages=messages_1)
response_2 = litellm.completion(model=model_name, messages=message_2)

# Add any assertions here to check the response
print(response_2.usage)

Calculate Cost

Cost cache-hit prompt tokens can differ from cache-miss prompt tokens.

Use the completion_cost() function for calculating cost (handles prompt caching cost calculation as well). See more helper functions

cost = completion_cost(completion_response=response, model=model)

Usage

SDK
PROXY

from litellm import completion, completion_cost
import litellm 
import os 

litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = "" 
model = "anthropic/claude-3-5-sonnet-20240620"
response = completion(
    model=model,
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant tasked with analyzing legal documents.",
                },
                {
                    "type": "text",
                    "text": "Here is the full text of a complex legal agreement" * 400,
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "what are the key terms and conditions in this agreement?",
        },
    ]
)

print(response.usage)

cost = completion_cost(completion_response=response, model=model) 

formatted_string = f"${float(cost):.10f}"
print(formatted_string)

LiteLLM returns the calculated cost in the response headers - x-litellm-response-cost

from openai import OpenAI

client = OpenAI(
    api_key="LITELLM_PROXY_KEY", # sk-1234..
    base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
response = client.chat.completions.with_raw_response.create(
    messages=[{
        "role": "user",
        "content": "Say this is a test",
    }],
    model="gpt-3.5-turbo",
)
print(response.headers.get('x-litellm-response-cost'))

completion = response.parse()  # get the object that `chat.completions.create()` would have returned
print(completion)

Check Model Support

Check if a model supports prompt caching with supports_prompt_caching()

SDK
PROXY

from litellm.utils import supports_prompt_caching

supports_pc: bool = supports_prompt_caching(model="anthropic/claude-3-5-sonnet-20240620")

assert supports_pc

Use the /model/info endpoint to check if a model on the proxy supports prompt caching

Setup config.yaml

model_list:
    - model_name: claude-3-5-sonnet-20240620
      litellm_params:
        model: anthropic/claude-3-5-sonnet-20240620
        api_key: os.environ/ANTHROPIC_API_KEY

Start proxy

litellm --config /path/to/config.yaml

Test it!

curl -L -X GET 'http://0.0.0.0:4000/v1/model/info' \
-H 'Authorization: Bearer sk-1234' \

Expected Response

{
    "data": [
        {
            "model_name": "claude-3-5-sonnet-20240620",
            "litellm_params": {
                "model": "anthropic/claude-3-5-sonnet-20240620"
            },
            "model_info": {
                "key": "claude-3-5-sonnet-20240620",
                ...
                "supports_prompt_caching": true # 👈 LOOK FOR THIS!
            }
        }
    ]
}

This checks our maintained model info/cost map

Auto-Inject Prompt Caching

Want LiteLLM to automatically add cache_control directives without modifying your code?

See Auto-Inject Prompt Caching Tutorial to learn how to use cache_control_injection_points to automatically cache system messages, specific messages by index, or custom injection patterns.

Quick Start​

OpenAI prompt_cache_key and prompt_cache_retention​

Anthropic Example​

Google AI Studio / Vertex AI (Gemini) Example​

Vertex AI​

Deepeek Example​

Calculate Cost​

Usage​

Check Model Support​

Read More​