Architecture

From trace to reuse-aware routing

Zumik sits between your agents and the providers. It measures what repeats, preserves the stable parts as reusable state, and routes each request through the cheapest reliable path, recording enough evidence that any decision can be explained or replayed later.

Step 01

Ingest a trace

Send metadata-only traces, tokenized captures, or provider exports. No raw prompts are required to start measuring.

Details Step 02

Score the workload

Prefix-family analysis produces a Workload Reuse Score and a reuse waterfall: opportunity, candidate, realized, and the missed-opportunity gap.

Details Step 03

Model the state

Stable inputs become artifacts and bundles behind opaque handles. Sessions and branches give multi-turn flows a causal, conflict-safe history.

Details Step 04

Resolve an alias

Each request resolves a logical alias (code.fast, auto.best) through an immutable release, recording exactly which model answered and why.

Details Step 05

Execute and report

The broker picks a profile, captures provider-native caching, and returns a QoS outcome: admitted, degraded, missed, rejected, or expired.

Details Step 06

Prove deletion

Delete revokes handles; purge jobs remove state and emit profile-specific receipts with any remaining retention window.

Details

The core correction

Logical state is not physical KV state.

The most important architectural decision in Zumik is splitting identity into three layers. It keeps customer handles stable while preventing cache implementation details from leaking into the product.

Layer 1

Logical identity

Artifacts, bundles, sessions, branches, snapshots. Customer-visible, opaque, independent of provider, model, or tokenizer.

snapshot_id

Layer 2

Materialization identity

The exact model-visible byte representation: tokenizer, prompt-compiler version, ordered block manifest. Two requests can share logical state but materialize differently.

materialization_key

Layer 3

KV realization compatibility

Whether an existing physical KV cache can be reused safely: model revision, quantization, engine, GPU topology, isolation namespace. The implementation detail that never leaks into product semantics.

kv_compatibility_key

Why

Two requests can share the same logical artifact yet need different KV realizations, different tokenizer, different quantization, managed versus BYOC. Collapsing these layers is how gateways end up leaking cache details into their API. Zumik refuses to.

Opportunity vs. capture

A repeated prefix is not a cache hit.

Zumik reports what could be reused (opportunity) separately from what was reused (capture), and attaches an evidence level to every number so a prediction is never mistaken for a measurement.

Read the reasoning

Example waterfallper request

Total input tokens100%

Eligible reuse78%

Candidate reuse66%

Realized reuse41%

Missed gap25%

Reproducibility

Every decision leaves a record you can replay.

A request pins one snapshot and one alias release. An alias release is immutable, changing a provider-model revision creates a new release rather than mutating the old one. Customer logs expose the release id, so any past routing decision can be explained, and a replay run can reproduce it on the same workload shape.

How aliases resolve

resolution record

{
  "requested_model": "code.fast",
  "alias_release": "alr_2026_06_09_003",
  "resolved_model": "anthropic/claude-haiku-4-5",
  "resolution_reason": "lowest_expected_latency_under_policy",
  "trace_id": "trc_9f12…"
}

Measure first. Migrate from evidence.

Run a workload diagnostic on real traffic, then let the reuse waterfall decide what to optimize.

Run a diagnostic See the API surface