When fetching a remote KV cache loses to just recomputing it

Pulling a cached KV block over the network is only a win when the transfer is cheaper than the prefill it replaces. Often it is not.

kv-cachebyoclatency
Published 2026-05-31

Tiered KV storage sounds unambiguously good: keep hot blocks in GPU memory, warm ones in host RAM, cold ones on NVMe or a remote store, and pull them back when needed instead of recomputing. The catch is the pull. Moving a large KV block across a network is not free, and there is a crossover point where recomputing the prefix from scratch is simply faster and cheaper than fetching the cached version.

Treating remote KV as always-a-win is how teams build a cache hierarchy that makes tail latency worse.

The transfer has a cost curve

A KV cache for a long prefix is large - it scales with tokens, layers, and model width. Fetching it over even a fast interconnect takes time and bandwidth, and that time competes directly with the prefill it is meant to save. For a short prefix, recompute is trivial and the fetch is pure overhead. For a very long prefix on a slow link, recompute might dominate and the fetch wins. The decision lives on a curve, not a flag.

The honest version of this feature measures both sides per request class and routes accordingly, rather than assuming the cache is always the answer.

Where remote KV actually pays

It pays when prefill is genuinely expensive - very long stable prefixes, reasoning models with heavy prompts - and the transfer path is fast and local, like the same rack or the same host. It pays when the alternative is a cold recompute on a TTFT-sensitive path where the latency budget is tight. And it pays when GPU time is the scarce resource and you would rather spend network than compute.

Outside those conditions, the cache tier is bookkeeping that slows you down.

Measure it, do not assume it

This is exactly the kind of claim that belongs in a replay run, not a design doc. Take the captured workload, compare the recompute baseline against the remote-fetch candidate on real prefix sizes and real link latency, and read the numbers. "Remote KV cut TTFT by a measured margin on prefixes over N tokens" is a decision. "Caching is faster" is not, because for half your traffic it is the opposite.

The cache hierarchy is a tool with a domain of validity. Find the crossover, route around it, and stop paying network to avoid compute that was cheaper all along.

Turn the idea into a measurement.

Run a diagnostic on your own traffic and see the reuse waterfall this post describes.