Capability · escalation

Self-host only when replay proves it pays.

BYOC moves the inference data plane into your cloud with explicit KV orchestration. It is powerful and operationally heavy, so Zumik treats it as an evidence-gated escalation - earned with a replay run, not reached for by default.

When BYOC pays off Score your workload first

The bar

Three reasons that survive scrutiny.

Dedicated latency SLO

A first-token target managed tiers cannot guarantee under your load.

Concentrated hot volume

Sustained traffic on one or two model paths, where warm caches pay off.

Isolation & purge evidence

Private networking, regional pinning, and runtime-confirmed deletion.

Gate

Before any lane moves, a replay run takes the captured workload shape, runs the managed baseline against the BYOC candidate, and emits a signed report. “Eighteen percent cheaper on the same traffic with equal capture” is a decision. A hunch is not.

Dynamo profile

Datacenter-scale KV orchestration.

For concentrated, high-volume lanes, Zumik composes best-in-class open infrastructure rather than reinventing the runtime.

NVIDIA Dynamo

Datacenter-scale orchestration with KV-aware routing and disaggregated prefill/decode.

SGLang + FlashInfer

Default runtime with RadixAttention and Cascade Attention for shared prefixes.

LMCache

Vendor-neutral KV management with pluggable storage backends.

Mooncake

RDMA-based zero-copy KV transfer across nodes.

AIBrix

Kubernetes-native control plane: distributed KV cache, autoscaler, LoRA, gateway.

Portable Kubernetes profile

When you want a vendor-neutral lane.

llm-d

Distributed serving with prefix-cache-aware routing - higher throughput, faster TTFT than round-robin.

KServe

Standardized CNCF inference on Kubernetes with llm-d and vLLM backends.

Gateway API Inference Extension

Endpoint selection using prefix-cache status and queue depth.

The cache hierarchy follows GPU HBM → host RAM → local NVMe → an optional remote KV backend, treated as a profile-specific optimization rather than a universal dependency.

Honest default

Most teams should not do this yet.

If a diagnostic shows managed providers already capture your reuse, Zumik will tell you to stay put. That is the whole point of measuring first.

Run a diagnostic Purge evidence

Frequently asked

BYOC, answered.

When is BYOC worth it?

For a dedicated latency SLO, concentrated hot-model volume, or strict isolation and purge requirements. Long prompts alone never justify it. The bar is a replay run showing a material gain.

What runtimes does Zumik use for BYOC?

A Dynamo profile (SGLang + FlashInfer + LMCache + Mooncake, on AIBrix) for datacenter KV orchestration, and a portable Kubernetes profile (llm-d + KServe + Gateway API Inference Extension).

Why not run BYOC by default?

Because managed-provider caching, batch tiers, and alias routing already capture most of the savings for typical workloads, and self-hosting carries real operational cost. BYOC is an escalation, not a starting point.

Can I run two routers at once?

No. One scheduler owns replica selection per profile. Zumik does not stack Dynamo Router and the Gateway API Inference Extension as simultaneous owners inside one path.

Decide BYOC with replay evidence

Score the workload, prove the gain with replay, then move only what the evidence supports.

When BYOC pays off Provider routing