TTFT savings from a warm prefix

How much faster is time-to-first-token when the prefix is already cached?

180k paired requestsLast run 2026-06-08

Prefill dominates first-token latency for long agent prompts. This suite measures cold versus cache-warm TTFT for each catalog model, which is the latency half of the reuse case.

What the corpus shows.

ModelCold TTFTWarm TTFTReductionNote
Claude Fable 5690 ms210 ms70%-
GPT-5.5720 ms240 ms67%-
Gemini 3.1 Pro820 ms360 ms56%-
Grok 4640 ms280 ms56%-
DeepSeek V4 (Fireworks)350 ms130 ms63%-
Takeaways
  • A warm prefix cuts first-token latency by roughly half to two-thirds.
  • Open-weights models on dedicated capacity start lower and stay lower.
  • TTFT savings justify reuse even when token cost is already low.
Methodology

Paired cold and warm requests for an identical stable prefix, sampled across regions during steady-state load. We report median TTFT and the warm-over-cold reduction.

How we grade evidence

Get these numbers for your traffic.

A diagnostic runs this analysis on your own workload and attaches an evidence level to every figure.