Fireworks AI prompt caching

Automatic prompt caching (serverless and dedicated). Here is how to capture the 74% cache-read discount on real agent traffic - and the mistakes that quietly erase it.

74%
Cache-read discount
none
Write premium
512
Min cacheable tokens

How it works

Fireworks serves open-weights models (DeepSeek, Llama, Kimi, Qwen, GLM, GPT-OSS) with prompt caching and speculative decoding. Dedicated deployments hold caches longer and give predictable latency, which is the bridge toward a full BYOC hot lane.

What Zumik observes

Fireworks is where managed serving meets BYOC. When a workload concentrates on one or two open-weights paths with strong locality, our replay runs frequently justify moving that lane to dedicated capacity or BYOC.

python - serverless vs dedicated
# Serverless caches with an idle window; a hot, reuse-heavy
# lane belongs on a dedicated deployment that holds the prefix warm.
client = OpenAI(base_url="https://api.zumik.ai/v1", api_key="zk_live_...")
r = client.responses.create(model="deepseek-v4", input=prompt)
Pitfall

Running a hot, reuse-heavy lane on serverless and blaming the model for cold-cache latency, when a dedicated deployment would hold the prefix warm.

Capturing Fireworks AI caching.

  1. Order stable content first. Put system policy, tools, and durable context at the front of the prompt so the cacheable prefix is as long as possible.
  2. Avoid volatile content near the top. Keep timestamps, request ids, and per-call notes out of the prefix; they reset the match and drop the hit rate.
  3. Confirm the hit. Read the usage object for cached tokens to verify the prefix is being reused at the read rate.
The full prompt-ordering playbook

Fireworks AI caching, answered.

How does Fireworks AI prompt caching work?

Fireworks serves open-weights models (DeepSeek, Llama, Kimi, Qwen, GLM, GPT-OSS) with prompt caching and speculative decoding. Dedicated deployments hold caches longer and give predictable latency, which is the bridge toward a full BYOC hot lane.

What does Fireworks AI caching save?

Cache reads are about 74% cheaper than list input.

What is the most common mistake?

Running a hot, reuse-heavy lane on serverless and blaming the model for cold-cache latency, when a dedicated deployment would hold the prefix warm.

How long does Fireworks AI keep a cache warm?

Serverless idle window; dedicated holds longer

Capture Fireworks AI caching automatically.

Zumik places stable content first, captures the discount, and reports how much you actually reused.