From trace to reuse-aware routing

Zumik sits between your agents and the providers. It measures what repeats, preserves the stable parts as reusable state, and routes each request through the cheapest reliable path, recording enough evidence that any decision can be explained or replayed later.

Logical state is not physical KV state.

The most important architectural decision in Zumik is splitting identity into three layers. It keeps customer handles stable while preventing cache implementation details from leaking into the product.

Layer 1

Logical identity

Artifacts, bundles, sessions, branches, snapshots. Customer-visible, opaque, independent of provider, model, or tokenizer.

snapshot_id
Layer 2

Materialization identity

The exact model-visible byte representation: tokenizer, prompt-compiler version, ordered block manifest. Two requests can share logical state but materialize differently.

materialization_key
Layer 3

KV realization compatibility

Whether an existing physical KV cache can be reused safely: model revision, quantization, engine, GPU topology, isolation namespace. The implementation detail that never leaks into product semantics.

kv_compatibility_key
Why

Two requests can share the same logical artifact yet need different KV realizations, different tokenizer, different quantization, managed versus BYOC. Collapsing these layers is how gateways end up leaking cache details into their API. Zumik refuses to.

Opportunity vs. capture

A repeated prefix is not a cache hit.

Zumik reports what could be reused (opportunity) separately from what was reused (capture), and attaches an evidence level to every number so a prediction is never mistaken for a measurement.

Read the reasoning
Example waterfallper request
Total input tokens100%
Eligible reuse78%
Candidate reuse66%
Realized reuse41%
Missed gap25%

Every decision leaves a record you can replay.

A request pins one snapshot and one alias release. An alias release is immutable, changing a provider-model revision creates a new release rather than mutating the old one. Customer logs expose the release id, so any past routing decision can be explained, and a replay run can reproduce it on the same workload shape.

How aliases resolve
resolution record
{
  "requested_model": "code.fast",
  "alias_release": "alr_2026_06_09_003",
  "resolved_model": "anthropic/claude-haiku-4-5",
  "resolution_reason": "lowest_expected_latency_under_policy",
  "trace_id": "trc_9f12…"
}

Measure first. Migrate from evidence.

Run a workload diagnostic on real traffic, then let the reuse waterfall decide what to optimize.