A cache-routing key tells the router to send requests that share a reusable prefix to the same backend, so the warm KV cache is actually on the node that gets the next request. Done well it is the single biggest lever on capture for self-hosted or dedicated lanes. Done carelessly it concentrates traffic until one node is on fire while the rest idle.
Both outcomes come from the same mechanism. The art is choosing the key.
Why the key helps
Without affinity, a load balancer spreads requests evenly, which is exactly wrong for cache reuse. The prefix you warmed on node A is useless when the next request lands on node B and recomputes everything. A routing key keyed on the prefix family fixes that: same family, same node, warm cache, real capture.
For workloads with strong recurrence and good retention locality, this is where a lot of the missed-opportunity gap closes. It is also the part managed providers handle for you, which is one reason their implicit caching often beats a naive self-hosted setup.
How it builds a hotspot
The danger is key skew. If one prefix family carries a disproportionate share of traffic, every one of those requests routes to a single node by design. That node saturates on compute, memory, or KV storage while peers sit idle, and tail latency on the hot path gets worse, not better. You bought a higher hit rate and a worse SLO at the same time.
It is the classic distributed-systems hot-key problem wearing an inference costume. A celebrity tenant, a default system prompt shared by most requests, or a single huge tool bundle can all become the hot key.
The mitigations that actually work
Shard the hot family across a small replica set and accept a slightly lower hit rate in exchange for spreading the load; a family that recurs constantly will re-warm fast on each replica. Cap the share of traffic any single key may pin before it spills to a second node. And measure concentration before you turn affinity on, because traffic concentration is one of the deployment-readiness signals for a reason.
The order matters. Turn on affinity for families with high recurrence and moderate concentration first. Leave the whale families on a replica set from day one.
When not to bother
If your recurrence is weak or your retention windows are short, routing keys add operational complexity for little gain, because the cache you so carefully pinned has already expired by the time the next compatible request arrives. Score the workload first. A routing key is an optimization for workloads that already have reuse to capture, not a way to manufacture it.