TTFT savings from a warm prefix
How much faster is time-to-first-token when the prefix is already cached?
Prefill dominates first-token latency for long agent prompts. This suite measures cold versus cache-warm TTFT for each catalog model, which is the latency half of the reuse case.
What the corpus shows.
| Model | Cold TTFT | Warm TTFT | Reduction | Note |
|---|---|---|---|---|
| Claude Fable 5 | 690 ms | 210 ms | 70% | - |
| GPT-5.5 | 720 ms | 240 ms | 67% | - |
| Gemini 3.1 Pro | 820 ms | 360 ms | 56% | - |
| Grok 4 | 640 ms | 280 ms | 56% | - |
| DeepSeek V4 (Fireworks) | 350 ms | 130 ms | 63% | - |
Takeaways
- A warm prefix cuts first-token latency by roughly half to two-thirds.
- Open-weights models on dedicated capacity start lower and stay lower.
- TTFT savings justify reuse even when token cost is already low.
Methodology
Paired cold and warm requests for an identical stable prefix, sampled across regions during steady-state load. We report median TTFT and the warm-over-cold reduction.
How we grade evidenceGet these numbers for your traffic.
A diagnostic runs this analysis on your own workload and attaches an evidence level to every figure.