Engineering notes on reuse, routing, and cost

Working notes on reuse, caching, routing, and inference cost, most of them backed by numbers from real routed traffic.

All topicsDiagnosticsProvidersArchitectureCostEngineering
Diagnostics7 min read

A repeated prompt is not a cache hit

Why Zumik reports reuse opportunity and realized capture as two separate numbers, and what the gap between them tells you.

Architecture8 min read

When a cache-routing key helps, and when it builds a hotspot

Pinning requests to a cache by key is the fastest way to raise hit rate and one of the easier ways to melt a single node. The tradeoff, measured.

Diagnostics8 min read

Scoring a workload before you change infrastructure

How the Workload Reuse Score is built from six components, and why prompt length alone never justified self-hosting.

Engineering7 min read

How to make model aliases actually reproducible

A logical name like code.fast is only useful if you can explain, months later, exactly which model answered and why. That takes immutable releases, not a config flag.

Diagnostics8 min read

How to benchmark agent workloads without storing raw prompts

You can measure reuse, capture, and TTFT honestly from metadata alone. Storing raw prompts is usually a liability you do not need to take on.

Providers9 min read

Anthropic vs OpenAI prompt caching, measured

Explicit breakpoints versus automatic prefix matching: which captures more, what each one punishes, and how to choose.

Architecture7 min read

When fetching a remote KV cache loses to just recomputing it

Pulling a cached KV block over the network is only a win when the transfer is cheaper than the prefill it replaces. Often it is not.

Architecture8 min read

When bringing your own cloud actually pays off

BYOC is an escalation, not a default. Here is the replay evidence we want to see before moving a lane off managed providers.

Architecture8 min read

When hierarchical KV storage improves TTFT, and when it just adds tiers

A KV hierarchy helps time-to-first-token only when retention locality and prefill cost line up. Otherwise it is complexity without a latency payoff.

Engineering6 min read

Prompt ordering is the cheapest optimization you are skipping

A practical ordering for agent prompts that maximizes cache hits without changing a single line of infrastructure.

Cost5 min read

Half your background tokens belong on a batch tier

Non-interactive traffic is the most over-paid line in most inference bills. Moving it to batch is a 50% discount waiting to be taken.

Measure your own reuse

The posts explain the method; a diagnostic applies it to your traffic.