A reusable handle is not a discount

Opaque state handles make reuse possible and convenient. They do not, by themselves, save a single token of prefill. Here is where the saving actually comes from.

handlesstatecaching
Published 2026-06-11

When we ship opaque handles for sessions, artifacts, and bundles, the most common reaction is the most expensive one: "great, so now the model reuses this for free." It does not. A handle is a reference, not a cache hit. Passing the same handle twice guarantees nothing about what the model recomputed.

The handle solves a different problem. It lets you avoid re-serializing large state on every request and gives the platform a stable identity to reason about. Whether that state turns into saved compute depends on what happens after the request leaves your code.

What the handle actually buys you

Three concrete things. First, you stop shipping the same megabyte of context over the wire every turn, which cuts payload and serialization cost on your side. Second, the platform can detect that two requests share a prefix family without diffing raw bytes, because the handle already tells it so. Third, it gives us a place to attach reuse metadata, so the diagnostic can report opportunity against a stable identity instead of guessing.

None of those three is a prefill discount. They are the preconditions for one.

Where the saving comes from

The actual saving lives in the provider or runtime cache. For a handle to translate into cheaper inference, the bytes it expands to have to land at the front of the prompt as a stable prefix, recur often enough to stay warm, and hit while still inside the retention window. Break any of those and the handle is just a tidy way to recompute the same tokens.

This is why we report opportunity and capture as separate numbers. A handle raises opportunity by making reusable state easy to identify and reference. Capture is still earned downstream, and it is still where most teams leak money.

The failure mode to watch

The trap is to treat the handle as an accounting fiction: "we passed the handle, so we did not pay for those tokens." We have seen bills that assumed exactly that and were wrong by a wide margin, because a timestamp in the expanded prefix was resetting the match every call. The handle was perfect; the cache never fired.

So the rule we hold to: a handle is a promise that reuse is possible, not a receipt that it happened. The receipt is provider-reported cached tokens or runtime-confirmed KV reuse, and that is what shows up on the invoice.

How to use handles well

Reference durable state by handle to cut payload redundancy, keep the expanded prefix byte-stable so caching can match it, and then read the realized-reuse number rather than assuming. If capture is low while you are diligently passing handles, the problem is almost always ordering or retention, not the handle. Fix that, and the convenience of the handle and the discount of the cache finally line up.

Turn the idea into a measurement.

Run a diagnostic on your own traffic and see the reuse waterfall this post describes.