Fireworks AI

Open-weights serving with speculative decoding and dedicated tiers.

74%
Cache-read discount
40%
Batch discount
237
Models on Zumik
Yes
BYOK supported

How caching works here

Fireworks serves open-weights models (DeepSeek, Llama, Kimi, Qwen, GLM, GPT-OSS) with prompt caching and speculative decoding. Dedicated deployments hold caches longer and give predictable latency, which is the bridge toward a full BYOC hot lane.

What Zumik sees

Fireworks is where managed serving meets BYOC. When a workload concentrates on one or two open-weights paths with strong locality, our replay runs frequently justify moving that lane to dedicated capacity or BYOC.

Pitfall

Running a hot, reuse-heavy lane on serverless and blaming the model for cold-cache latency, when a dedicated deployment would hold the prefix warm.

Profile
Min cache size512 tok
RetentionServerless idle window; dedicated holds longer
Service tiersserverless, dedicated
BYOCAvailable
open-weightshigh-volumeBYOC migration candidates

Fireworks AI models in the catalog.

ModelContextInputOutputCache readReuse-adj
Qwen3 4B128K$0.03$0.03$0.03 $0.03
Gemma 3 12B Instruct131K$0.05$0.10$0.05 $0.06
OpenAI gpt-oss-20b131K$0.07$0.30$0.04 50%$0.11
OpenAI gpt-oss-safeguard-20b131K$0.07$0.30$0.04 51%$0.12
Qwen3-VL-8B-Instruct131K$0.08$0.50$0.08 $0.18
MythoMax L2 13B4K$0.09$0.09$0.09 $0.09
Qwen3 235B A22B Instruct 2507131K$0.09$0.58$0.09 $0.21
Qwen3 30B-A3B41K$0.09$0.45$0.09 $0.18
Devstral-Small-2505128K$0.10$0.30$0.10 $0.15
Qwen3 30B A3B Instruct 2507128K$0.10$0.30$0.01 90%$0.11
Gemma 3 27B Instruct98K$0.12$0.20$0.12 $0.14
DeepSeek-V4-Flash1M$0.14$0.28$0.03 79%$0.13
DeepSeek R1 Distill Qwen 14B33K$0.15$0.15$0.15 $0.15
OpenAI gpt-oss-120b131K$0.15$0.60$0.01 90%$0.21
Qwen3 8B131K$0.18$0.70$0.18 $0.31
Qwen3 VL 30B A3B Instruct131K$0.20$0.70$0.20 $0.33
Qwen3 VL 30B A3B Thinking131K$0.20$1.00$0.20 $0.40
Qwen3 Omni 30B A3B Instruct66K$0.25$0.97$0.25 $0.43
Deepseek V3 03-24164K$0.27$1.12$0.14 50%$0.43
DeepSeek R1 Distill Qwen 32B64K$0.30$0.30$0.30 $0.30
MiniMax M2.7197K$0.30$1.20$0.06 80%$0.43
Minimax M3512K$0.30$1.20$0.06 80%$0.43
MiniMax-M2205K$0.30$1.20$0.03 90%$0.41
NVIDIA Nemotron 3 Super 120B A12B BF16256K$0.30$0.90$0.30 $0.45
NVIDIA Nemotron 3 Super 120B A12B FP8256K$0.30$0.90$0.30 $0.45
Qwen3 235B A22B Thinking 2507131K$0.30$3.00$0.30 $0.97
Qwen3 VL 235B A22B Instruct131K$0.30$1.50$0.30 $0.60
Qwen3 14B131K$0.35$1.40$0.35 $0.61
Qwen3 Coder 30B A3B Instruct262K$0.45$2.25$0.45 $0.90
Deepseek R1 05/28164K$0.50$2.15$0.35 30%$0.85
Kimi K2 Instruct131K$0.57$2.30$0.57 $1.00
Kimi K2 Instruct 0905131K$0.57$2.30$0.57 $1.00
Kimi K2 Thinking262K$0.60$2.50$0.15 75%$0.89
Qwen3 235B A22B131K$0.70$2.80$0.70 $1.22
Qwen3 32B131K$0.70$2.80$0.70 $1.22
DeepSeek R1 Distill Llama 70B8K$0.80$0.80$0.80 $0.80
Kimi K2.6262K$0.95$4.00$0.16 83%$1.39
Kimi K2.7 Code262K$0.95$4.00$0.19 80%$1.40
Qwen3 VL 235B A22B Thinking131K$0.98$3.95$0.98 $1.72
GLM-5205K$1.00$3.20$0.20 80%$1.22
DeepSeek V3131K$1.25$1.25$1.25 $1.25
GLM 5.1203K$1.40$4.40$0.26 81%$1.68
Qwen3 Coder 480B A35B Instruct262K$1.50$7.50$1.50 $3.00
DeepSeek-V4-Pro1M$1.74$3.48$0.14 92%$1.52
DeepSeek R1 (Fast)164K$3.00$7.00$3.00 $4.00
Chronos Hermes 13B v24K
Code Llama 13B16K
Code Llama 13B Instruct16K
Code Llama 13B Python16K
Code Llama 34B16K
Code Llama 34B Instruct16K
Code Llama 34B Python16K
Code Llama 70B4K
Code Llama 70B Instruct4K
Code Llama 70B Python4K
Code Llama 7B16K
Code Llama 7B Instruct16K
CodeGemma 2B8K
CodeGemma 7B8K
CodeQwen 1.5 7B66K
Cogito v1 Preview Llama 3B131K
Cogito v1 Preview Llama 70B131K
Cogito v1 Preview Llama 8B131K
Cogito v1 Preview Qwen 14B131K
Cogito v1 Preview Qwen 32B131K
DeepSeek Coder 1.3B Base16K
DeepSeek Coder 33B Instruct16K
DeepSeek Coder 7B Base4K
DeepSeek Coder 7B Base v1.54K
DeepSeek Coder 7B Instruct v1.54K
DeepSeek Coder V2 Lite Base164K
DeepSeek Coder V2 Lite Instruct164K
DeepSeek Prover V2164K
DeepSeek R1 (Basic)164K
DeepSeek R1 0528 Distill Qwen3 8B131K
DeepSeek R1 Distill Llama 8B131K
DeepSeek R1 Distill Qwen 1.5B131K
DeepSeek R1 Distill Qwen 7B131K
DeepSeek V2 Lite Chat164K
DeepSeek V2.533K
DeepSeek V3.1164K
DeepSeek V3.1 Terminus164K
Deepseek v3.2164K
Dolphin 2.6 Mixtral 8x7b33K
Dolphin 2.9.2 Qwen2 72B131K
ERNIE-4.5-21B-A3B-PT131K
FARE-20B131K
Firesearch OCR V68K
Gemma 2 9B Instruct8K
Gemma 2B Instruct8K
Gemma 3 4B Instruct131K
Gemma 4 31B IT NVFP4262K
Gemma 4 E4B131K
Gemma 7B8K
Gemma 7B Instruct8K
GLM-4.5131K
GLM-4.5-Air131K
GLM-4.5V131K
GLM-4.6203K
GLM-4.7203K
GLM-4.7 Flash203K
Hermes 2 Pro Mistral 7B33K
InternVL3 38B16K
InternVL3 78B16K
InternVL3 8B16K
KAT Dev 32B131K
KAT Dev 72B Exp131K
Kimi K2.5262K
Llama 2 13B4K
Llama 2 13B Chat4K
Llama 2 70B4K
Llama 2 7B4K
Llama 2 7B Chat4K
Llama 3 70B Instruct8K
Llama 3 70B Instruct (HF version)8K
Llama 3 8B8K
Llama 3 8B Instruct8K
Llama 3 8B Instruct (HF version)8K
Llama 3.1 405B Instruct131K
Llama 3.1 70B Instruct131K
Llama 3.1 8B Instruct131K
Llama 3.1 Nemotron 70B131K
Llama 3.2 11B Vision Instruct131K
Llama 3.2 1B131K
Llama 3.2 1B Instruct131K
Llama 3.2 3B131K
Llama 3.2 3B Instruct131K
Llama 3.2 90B Vision Instruct131K
Llama 3.3 70B Instruct131K
Llama 4 Maverick Instruct (Basic)1M
Llama 4 Scout Instruct (Basic)1M
Llama Guard 3 8B131K
Llama Guard 7B4K
Llama Guard v2 8B8K
Llama Guard v3 1B131K
MiniMax-M2.1197K
MiniMax-M2.5197K
Ministral 3 14B Instruct 2512256K
Ministral 3 3B Instruct 2512256K
Ministral 3 8B Instruct 2512256K
MiroThinker-1.7262K
Mistral 7B33K
Mistral 7B Instruct v0.233K
Mistral 7B Instruct v0.333K
Mistral 7B OpenOrca33K
Mistral 7B v0.233K
Mistral Large 3 675B Instruct 2512256K
Mistral Nemo Base 2407128K
Mistral Nemo Instruct 2407128K
Mistral Small 24B Instruct 250133K
Mixtral 8x7B v0.133K
Mixtral Moe 8x22B66K
Mixtral MoE 8x22B Instruct66K
Mixtral MoE 8x7B Instruct33K
Mixtral MoE 8x7B Instruct (HF version)33K
Molmo2-4B37K
Molmo2-8B37K
Nous Capybara 7B V1.933K
Nous Hermes Llama2 13B4K
Nous Hermes Llama2 70B4K
Nous Hermes Llama2 7B4K
Nouse Hermes 2 Mixtral 8x7B DPO33K
NVIDIA Nemotron 3 Nano Omni 30B A3B262K
NVIDIA Nemotron 3 Super 120B A12B NVFP4262K
NVIDIA Nemotron 3 Ultra BF16262K
NVIDIA Nemotron 3 Ultra NVFP4262K
NVIDIA Nemotron Nano 12B v2128K
NVIDIA Nemotron Nano 2 VL131K
NVIDIA Nemotron Nano 9B v2128K
OpenAI gpt-oss-safeguard-120b131K
OpenChat 3.5 01068K
OpenHermes 2 Mistral 7B33K
OpenHermes 2.5 Mistral 7B33K
Phi-3 Mini 128k Instruct131K
Phi-3.5 Vision Instruct32K
Phind CodeLlama 34B Python v116K
Phind CodeLlama 34B v116K
Phind CodeLlama 34B v216K
Pythia 12B2K
Qwen 3 4B Instruct 2507262K
Qwen 3.5 122B A10B262K
Qwen 3.5 35B A3B262K
Qwen QWQ 32B Preview33K
Qwen1.5 72B Chat33K
Qwen2 72B Instruct33K
Qwen2 7B Instruct33K
Qwen2-VL 2B Instruct33K
Qwen2-VL 72B Instruct33K
Qwen2-VL 7B Instruct33K
Qwen2.5 0.5B Instruct33K
Qwen2.5 1.5B Instruct33K
Qwen2.5 14B131K
Qwen2.5 14B Instruct33K
Qwen2.5 14B Instruct33K
Qwen2.5 32B131K
Qwen2.5 32B Instruct33K
Qwen2.5 72B131K
Qwen2.5 72B Instruct33K
Qwen2.5 7B131K
Qwen2.5 7B131K
Qwen2.5 7B Instruct33K
Qwen2.5-Coder 0.5B33K
Qwen2.5-Coder 0.5B Instruct33K
Qwen2.5-Coder 1.5B33K
Qwen2.5-Coder 1.5B Instruct33K
Qwen2.5-Coder 14B33K
Qwen2.5-Coder 14B Instruct33K
Qwen2.5-Coder 32B33K
Qwen2.5-Coder 32B Instruct33K
Qwen2.5-Coder 32B Instruct 128K131K
Qwen2.5-Coder 32B Instruct 32K RoPE33K
Qwen2.5-Coder 32B Instruct 64k66K
Qwen2.5-Coder 3B33K
Qwen2.5-Coder 3B Instruct33K
Qwen2.5-Coder 7B33K
Qwen2.5-Coder 7B Instruct33K
Qwen2.5-Math 72B Instruct4K
Qwen2.5-VL 32B Instruct128K
Qwen2.5-VL 3B Instruct128K
Qwen2.5-VL 72B Instruct128K
Qwen2.5-VL 7B Instruct128K
Qwen3 0.6B41K
Qwen3 1.7B131K
Qwen3 30B A3B Thinking 2507262K
Qwen3 Coder 480B Instruct BF16262K
Qwen3.5 27B262K
Qwen3.5 397B A17B262K
Qwen3.5 9B262K
Qwen3.6 27B262K
Qwen3.6-35B-A3B262K
QWQ 32B131K
Rolm OCR128K
Seed OSS 36B Instruct524K
Snorkel Mistral PairRM DPO33K
Step-3.7-Flash-NVFP4262K
Toppy M 7B33K
Zephyr 7B Beta33K

Fireworks AI, answered.

How does Fireworks AI prompt caching work?

Fireworks serves open-weights models (DeepSeek, Llama, Kimi, Qwen, GLM, GPT-OSS) with prompt caching and speculative decoding. Dedicated deployments hold caches longer and give predictable latency, which is the bridge toward a full BYOC hot lane.

What discount does Fireworks AI caching give?

Cache reads on Fireworks AI are about 74% cheaper than list input price.

Does Fireworks AI support BYOK on Zumik?

Yes. You can bring your own Fireworks AI key, and provider-native caching, batch, and service tiers stay active under your account.

What is the common Fireworks AI caching mistake?

Running a hot, reuse-heavy lane on serverless and blaming the model for cold-cache latency, when a dedicated deployment would hold the prefix warm.

Route Fireworks AI the smart way.

Capture Fireworks AI's 74% cache-read discount and batch tier automatically through Zumik.