Engineering notes on reuse, routing, and cost
Working notes on reuse, caching, routing, and inference cost, most of them backed by numbers from real routed traffic.
A repeated prompt is not a cache hit
Why Zumik reports reuse opportunity and realized capture as two separate numbers, and what the gap between them tells you.
When a cache-routing key helps, and when it builds a hotspot
Pinning requests to a cache by key is the fastest way to raise hit rate and one of the easier ways to melt a single node. The tradeoff, measured.
Scoring a workload before you change infrastructure
How the Workload Reuse Score is built from six components, and why prompt length alone never justified self-hosting.
How to make model aliases actually reproducible
A logical name like code.fast is only useful if you can explain, months later, exactly which model answered and why. That takes immutable releases, not a config flag.
How to benchmark agent workloads without storing raw prompts
You can measure reuse, capture, and TTFT honestly from metadata alone. Storing raw prompts is usually a liability you do not need to take on.
Anthropic vs OpenAI prompt caching, measured
Explicit breakpoints versus automatic prefix matching: which captures more, what each one punishes, and how to choose.
When fetching a remote KV cache loses to just recomputing it
Pulling a cached KV block over the network is only a win when the transfer is cheaper than the prefill it replaces. Often it is not.
When bringing your own cloud actually pays off
BYOC is an escalation, not a default. Here is the replay evidence we want to see before moving a lane off managed providers.
When hierarchical KV storage improves TTFT, and when it just adds tiers
A KV hierarchy helps time-to-first-token only when retention locality and prefill cost line up. Otherwise it is complexity without a latency payoff.
Prompt ordering is the cheapest optimization you are skipping
A practical ordering for agent prompts that maximizes cache hits without changing a single line of infrastructure.
Half your background tokens belong on a batch tier
Non-interactive traffic is the most over-paid line in most inference bills. Moving it to batch is a 50% discount waiting to be taken.
Measure your own reuse
The posts explain the method; a diagnostic applies it to your traffic.