Resources · glossary

The vocabulary of inference reuse

No jargon for jargon's sake. Each term gets a one-sentence definition you can quote and a link into the product where the concept actually lives.

Alias release

An immutable, versioned snapshot of an alias’s resolution policy, so a routing decision can be reproduced exactly.

BYOC (bring your own cloud)

Running the inference data plane inside the customer’s cloud for dedicated SLOs, isolation, and explicit KV orchestration.

BYOK (bring your own key)

Using the customer’s own provider credentials so billing, quotas, and retention follow their provider agreements.

Cache capture rate

Realized reused tokens divided by candidate reusable tokens - how much of the available reuse a provider actually delivered.

Evidence level

A label for how trustworthy a reuse measurement is, from provider_reported down to trace_estimated and unknown.

KV cache

The stored key/value attention tensors a model computes during prefill, kept so the same prefix does not have to be recomputed.

Model alias

A stable logical name like code.fast that resolves through an immutable release to a concrete provider model.

Opaque handle

A random, tenant-scoped identifier (art_, bnd_, ses_…) for reusable state that never exposes a content hash.

Prefill

The phase where a model reads and encodes the input prompt before it begins generating output tokens.

Prompt caching

Reusing the computed state of a repeated prompt prefix so it is billed at a reduced cache-read rate instead of being recomputed.

Purge receipt

Signed evidence describing what a purge job deleted, under which profile, and what retention remains.

Replay

A controlled experiment that re-runs a captured workload shape against candidate execution profiles to compare cost, latency, and capture.

Retention locality

The share of repeated requests that recur within the provider or runtime cache-retention window, so the cache is still warm.

Reuse opportunity

The maximum share of input tokens that could be served from cache, independent of whether they actually were.

TTFT (time to first token)

The latency from sending a request to receiving the first generated token, dominated by prefill on long prompts.

Workload Reuse Score (WRS)

A 0-100 score of how much a workload can benefit from reuse, built from opportunity, recurrence, locality, latency sensitivity, continuity, and payload redundancy.

Put these concepts to work

Run a diagnostic and watch reuse opportunity, capture rate, and TTFT savings come alive on your own traffic.

Run a diagnostic Read the blog