Google Gemini prompt caching

Implicit context caching. Here is how to capture the 75% cache-read discount on real agent traffic - and the mistakes that quietly erase it.

75%
Cache-read discount
none
Write premium
2,048
Min cacheable tokens

How it works

Implicit caching applies discounts automatically when a request shares a long prefix with a recent one, with no breakpoints to manage. Explicit cached-content handles are also available for content you know will recur, trading setup for predictability.

What Zumik observes

Implicit caching is convenient but its capture is the least predictable of the proprietary providers in our data - savings appear, then vanish when traffic interleaves. Zumik reports it at the trace_estimated to provider_reported range depending on response detail.

python - implicit + explicit
# Implicit caching needs no breakpoints - share a long prefix
# with a recent request and the discount applies automatically.
# For content you KNOW recurs, create explicit cached content:
cache = client.caches.create(
    model="gemini-3-1-pro",
    contents=[STABLE_CONTEXT],   # reused at the read rate
)
Pitfall

Assuming the 2M-token window means everything is cheap. Implicit hits depend on recency and prefix overlap, not just on fitting inside the window.

Capturing Google Gemini caching.

  1. Order stable content first. Put system policy, tools, and durable context at the front of the prompt so the cacheable prefix is as long as possible.
  2. Avoid volatile content near the top. Keep timestamps, request ids, and per-call notes out of the prefix; they reset the match and drop the hit rate.
  3. Confirm the hit. Read the usage object for cached tokens to verify the prefix is being reused at the read rate.
The full prompt-ordering playbook

Google Gemini caching, answered.

How does Google Gemini prompt caching work?

Implicit caching applies discounts automatically when a request shares a long prefix with a recent one, with no breakpoints to manage. Explicit cached-content handles are also available for content you know will recur, trading setup for predictability.

What does Google Gemini caching save?

Cache reads are about 75% cheaper than list input.

What is the most common mistake?

Assuming the 2M-token window means everything is cheap. Implicit hits depend on recency and prefix overlap, not just on fitting inside the window.

How long does Google Gemini keep a cache warm?

Implicit, minutes; explicit cached content configurable

Capture Google Gemini caching automatically.

Zumik places stable content first, captures the discount, and reports how much you actually reused.