Glossary

Prefill

The phase where a model reads and encodes the input prompt before it begins generating output tokens.

Inference splits into prefill (reading the prompt) and decode (generating tokens). Prefill cost scales with input length; decode cost scales with output length.

Agent workloads are prefill-heavy: long stable instructions and tool definitions, short outputs. That asymmetry is precisely why reuse matters - most of the bill is the part that repeats.

Related terms

Keep reading.

TTFT (time to first token)

The latency from sending a request to receiving the first generated token, dominated by prefill on long prompts.

KV cache

The stored key/value attention tensors a model computes during prefill, kept so the same prefix does not have to be recomputed.

Reuse opportunity

The maximum share of input tokens that could be served from cache, independent of whether they actually were.

See it in practice.

Definitions are useful; measurement is better. Run a diagnostic on your own workload.

Run a diagnostic Back to glossary