Self-host only when replay proves it pays.
BYOC moves the inference data plane into your cloud with explicit KV orchestration. It is powerful and operationally heavy, so Zumik treats it as an evidence-gated escalation - earned with a replay run, not reached for by default.
Three reasons that survive scrutiny.
Dedicated latency SLO
A first-token target managed tiers cannot guarantee under your load.
Concentrated hot volume
Sustained traffic on one or two model paths, where warm caches pay off.
Isolation & purge evidence
Private networking, regional pinning, and runtime-confirmed deletion.
Before any lane moves, a replay run takes the captured workload shape, runs the managed baseline against the BYOC candidate, and emits a signed report. “Eighteen percent cheaper on the same traffic with equal capture” is a decision. A hunch is not.
Datacenter-scale KV orchestration.
For concentrated, high-volume lanes, Zumik composes best-in-class open infrastructure rather than reinventing the runtime.
NVIDIA Dynamo
Datacenter-scale orchestration with KV-aware routing and disaggregated prefill/decode.
SGLang + FlashInfer
Default runtime with RadixAttention and Cascade Attention for shared prefixes.
LMCache
Vendor-neutral KV management with pluggable storage backends.
Mooncake
RDMA-based zero-copy KV transfer across nodes.
AIBrix
Kubernetes-native control plane: distributed KV cache, autoscaler, LoRA, gateway.
When you want a vendor-neutral lane.
llm-d
Distributed serving with prefix-cache-aware routing - higher throughput, faster TTFT than round-robin.
KServe
Standardized CNCF inference on Kubernetes with llm-d and vLLM backends.
Gateway API Inference Extension
Endpoint selection using prefix-cache status and queue depth.
The cache hierarchy follows GPU HBM → host RAM → local NVMe → an optional remote KV backend, treated as a profile-specific optimization rather than a universal dependency.
Most teams should not do this yet.
If a diagnostic shows managed providers already capture your reuse, Zumik will tell you to stay put. That is the whole point of measuring first.
BYOC, answered.
When is BYOC worth it?
For a dedicated latency SLO, concentrated hot-model volume, or strict isolation and purge requirements. Long prompts alone never justify it. The bar is a replay run showing a material gain.
What runtimes does Zumik use for BYOC?
A Dynamo profile (SGLang + FlashInfer + LMCache + Mooncake, on AIBrix) for datacenter KV orchestration, and a portable Kubernetes profile (llm-d + KServe + Gateway API Inference Extension).
Why not run BYOC by default?
Because managed-provider caching, batch tiers, and alias routing already capture most of the savings for typical workloads, and self-hosting carries real operational cost. BYOC is an escalation, not a starting point.
Can I run two routers at once?
No. One scheduler owns replica selection per profile. Zumik does not stack Dynamo Router and the Gateway API Inference Extension as simultaneous owners inside one path.
Decide BYOC with replay evidence
Score the workload, prove the gain with replay, then move only what the evidence supports.