When hierarchical KV storage improves TTFT, and when it just adds tiers

Time-to-first-token is dominated by prefill: the model has to process the whole prompt before it emits anything. A KV hierarchy promises to shortcut that by keeping computed attention state around and reusing it. When it works, the first token comes back fast because most of the prefix was never recomputed. When it does not, you have built three storage tiers that move the latency problem around without solving it.

The difference comes down to whether the workload has the properties the hierarchy needs.

TTFT and reuse are the same lever here

A cached prefix only cuts TTFT if the cache hits, and it only hits if the reusable block recurs inside the retention window and lands on a node that holds it. That is retention locality and routing affinity again, viewed through a latency lens instead of a cost one. High recurrence with tight locality means the hot tier serves most requests and TTFT drops hard. Weak recurrence means you are constantly promoting cold blocks up the hierarchy, paying transfer cost on the critical path, and watching the first token arrive late anyway.

So the prerequisite for a TTFT win is the same prerequisite as a cost win: there has to be real, local, in-window reuse to capture.

The tiers each have a latency price

GPU-memory hits are the only ones that are unambiguously fast. Host RAM is slower but usually still beats recompute for long prefixes. NVMe and remote stores are where it gets interesting, because a fetch from a cold tier can take longer than just prefilling the prompt, especially for short or medium prefixes. A hierarchy that aggressively demotes blocks to cold storage can hurt TTFT on exactly the requests that fall through to the cold tier.

The good designs keep the hot path short and accept that cold blocks are sometimes cheaper to recompute than to retrieve.

When it is worth the operational weight

A KV hierarchy earns its complexity when prefill is expensive, recurrence is strong, retention locality is good, and TTFT is on a real SLO that managed tiers cannot meet. That is a narrow intersection, and it overlaps closely with the conditions that justify BYOC at all. If your workload does not sit in that intersection, provider-native caching plus good prompt ordering will get you most of the TTFT improvement with none of the tier-management burden.

Score the workload before you build the hierarchy. The reuse signals that predict cost savings predict latency savings too, and if they are weak, more storage tiers will not rescue your first-token time.

When hierarchical KV storage improves TTFT, and when it just adds tiers

TTFT and reuse are the same lever here

The tiers each have a latency price

When it is worth the operational weight

Keep going.

When fetching a remote KV cache loses to just recomputing it

When bringing your own cloud actually pays off

Scoring a workload before you change infrastructure

Turn the idea into a measurement.