Guide · caching

Prompt caching, captured - not just configured.

Every provider caches prompts, but the discount only lands if you place stable content correctly and respect each scheme's quirks. These guides explain how to capture the savings, and the small mistakes that quietly erase them.

Core idea

Caching reuses the KV state of a repeated prefix so it is billed at a reduced read rate instead of recomputed. The universal rule across every scheme: keep stable content at the front, push volatile content to the end. See the KV cache and prompt caching definitions.

OpenAI

−75% read

Automatic prefix caching.

Injecting a timestamp, request id, or per-call system note near the top of the prompt resets the prefix and silently drops the hit rate to near zero.

Read the guide

Anthropic

−90% read

Explicit cache_control breakpoints.

Putting a breakpoint after volatile content, or breaking on every turn, means you pay write premiums repeatedly and never recover them.

Read the guide

Google Gemini

−75% read

Implicit context caching.

Assuming the 2M-token window means everything is cheap. Implicit hits depend on recency and prefix overlap, not just on fitting inside the window.

Read the guide

xAI

−75% read

Context caching.

Treating xAI like OpenAI for background jobs - there is no 50% batch discount to fall back on, so non-interactive work should usually route elsewhere.

Read the guide

Fireworks AI

−74% read

Automatic prompt caching (serverless and dedicated).

Running a hot, reuse-heavy lane on serverless and blaming the model for cold-cache latency, when a dedicated deployment would hold the prefix warm.

Read the guide

Scheme types

Four ways providers cache.

Automatic

Caches any prefix over a threshold with no markup. Forgiving, capped discount. (OpenAI, Fireworks)

Explicit

You place cache_control breakpoints. Highest discount, write premium. (Anthropic)

Implicit

Discounts apply on recent prefix overlap, no breakpoints. Convenient, variable. (Gemini)

Context

Reuses a cached context across consecutive requests. (xAI)

Measure your real capture rate.

See how much of each scheme's discount your workload actually lands - with provider-reported evidence.

Capture benchmark Run a diagnostic