Glossary

TTFT (time to first token)

The latency from sending a request to receiving the first generated token, dominated by prefill on long prompts.

Time to first token is what users feel as responsiveness. For short prompts it is small; for long agent prompts it is dominated by prefill - the cost of reading the whole input before decoding starts.

A warm prefix cuts prefill, so TTFT drops sharply when caching hits. In our benchmark, a warm prefix reduces TTFT by roughly half to two-thirds.

Related terms

Keep reading.

Prefill

The phase where a model reads and encodes the input prompt before it begins generating output tokens.

Prompt caching

Reusing the computed state of a repeated prompt prefix so it is billed at a reduced cache-read rate instead of being recomputed.

KV cache

The stored key/value attention tensors a model computes during prefill, kept so the same prefix does not have to be recomputed.

See it in practice.

Definitions are useful; measurement is better. Run a diagnostic on your own workload.

Run a diagnostic Back to glossary