OpenAI prompt caching

Automatic prefix caching. Here is how to capture the 75% cache-read discount on real agent traffic - and the mistakes that quietly erase it.

75%
Cache-read discount
none
Write premium
1,024
Min cacheable tokens

How it works

Caching is automatic for prompts at or above 1,024 tokens. The longest matching prefix is reused, billed at the cache-read rate, and reported back as cached tokens in the usage object. Keeping stable content at the front of the request is what makes the prefix match.

What Zumik observes

Across our corpus, OpenAI returns provider-reported cached-token counts on most eligible requests, which gives Zumik the strongest evidence level (provider_reported) for capture without any runtime instrumentation.

python - keep stable content first
# OpenAI caches automatically >= 1024 tokens.
# Just order stable -> volatile.
messages = [
    {"role": "system", "content": STABLE_POLICY},     # cached
    {"role": "system", "content": TOOL_DEFINITIONS},  # cached
    {"role": "user", "content": latest_message},      # changes
]
# NEVER put a timestamp or request id in the system block.
Pitfall

Injecting a timestamp, request id, or per-call system note near the top of the prompt resets the prefix and silently drops the hit rate to near zero.

Capturing OpenAI caching.

  1. Order stable content first. Put system policy, tools, and durable context at the front of the prompt so the cacheable prefix is as long as possible.
  2. Avoid volatile content near the top. Keep timestamps, request ids, and per-call notes out of the prefix; they reset the match and drop the hit rate.
  3. Confirm the hit. Read the usage object for cached tokens to verify the prefix is being reused at the read rate.
The full prompt-ordering playbook

OpenAI caching, answered.

How does OpenAI prompt caching work?

Caching is automatic for prompts at or above 1,024 tokens. The longest matching prefix is reused, billed at the cache-read rate, and reported back as cached tokens in the usage object. Keeping stable content at the front of the request is what makes the prefix match.

What does OpenAI caching save?

Cache reads are about 75% cheaper than list input.

What is the most common mistake?

Injecting a timestamp, request id, or per-call system note near the top of the prompt resets the prefix and silently drops the hit rate to near zero.

How long does OpenAI keep a cache warm?

Minutes idle, up to ~24h on extended retention

Capture OpenAI caching automatically.

Zumik places stable content first, captures the discount, and reports how much you actually reused.