Benchmark · ms / %

TTFT savings from a warm prefix

How much faster is time-to-first-token when the prefix is already cached?

Prefill dominates first-token latency for long agent prompts. This suite measures cold versus cache-warm TTFT for each catalog model, which is the latency half of the reuse case.

Results

What the corpus shows.

Model	Cold TTFT	Warm TTFT	Reduction	Note
Claude Fable 5	690 ms	210 ms	70%	-
GPT-5.5	720 ms	240 ms	67%	-
Gemini 3.1 Pro	820 ms	360 ms	56%	-
Grok 4	640 ms	280 ms	56%	-
DeepSeek V4 (Fireworks)	350 ms	130 ms	63%	-

Takeaways

A warm prefix cuts first-token latency by roughly half to two-thirds.
Open-weights models on dedicated capacity start lower and stay lower.
TTFT savings justify reuse even when token cost is already low.

Methodology

Paired cold and warm requests for an identical stable prefix, sampled across regions during steady-state load. We report median TTFT and the warm-over-cold reduction.

How we grade evidence

Prompt-cache capture by provider

Of the reuse a workload could capture, how much do providers actually deliver?

Reuse opportunity by workload type

Which agent workloads actually have reusable structure?

Get these numbers for your traffic.

A diagnostic runs this analysis on your own workload and attaches an evidence level to every figure.

Run a diagnostic See model pricing

What the corpus shows.

Other benchmark suites.

Prompt-cache capture by provider

Reuse opportunity by workload type

Get these numbers for your traffic.