Scoring a workload before you change infrastructure

For a while the industry used a lazy heuristic: long prompts mean you should self-host. It is wrong often enough to be dangerous. Plenty of long-prompt workloads have terrible retention locality, and plenty of medium-prompt workloads recur fast enough to cache beautifully.

So we replaced the heuristic with a score.

Six components, one number

Workload Reuse Score blends opportunity ratio, recurrence, retention locality, TTFT sensitivity, session continuity, and payload redundancy. Each is normalized, then combined into a 0 to 100 value you can compare across workloads.

A coding agent with stable scaffolding and tight recurrence scores high. A consumer chat app with weak locality scores low, and that is the right answer even if its prompts happen to be long.

Score and readiness are different questions

A high score means there is reuse to capture. It does not mean you should run your own GPUs. We track deployment readiness separately: provider capability, BYOK feasibility, security approvals, region constraints, traffic concentration, and the operational appetite to run infrastructure.

Keeping those two questions apart stops the most common bad decision in this space, which is buying a hot lane because the prompts were big.

What to do at each band

Above 70, prioritize an optimization pilot. From 45 to 69, run a diagnostic and tune providers. From 20 to 44, fix prompt construction first. Below 20, do not sell yourself a caching project; the opportunity is not there.

Scoring a workload before you change infrastructure

Six components, one number

Score and readiness are different questions

What to do at each band

Keep going.

A repeated prompt is not a cache hit

When bringing your own cloud actually pays off

Prompt ordering is the cheapest optimization you are skipping

Turn the idea into a measurement.