Measure first. Escalate only when the evidence earns it.
There is one sensible path through inference optimization, and it does not start with buying GPUs. It starts with a free scan, follows the evidence, and reaches BYOC only if a replay proves the case.
One path, five stops, no shortcuts.
1Free workload scan
No payment methodA metadata-only analysis with its own hard cap. It returns a Workload Reuse Score and the reuse waterfall so you can see whether there is anything to chase before spending a dollar.
2Paid diagnostic
Fixed engagementThe full Agent Workload Efficiency Diagnostic: prefix families, retention windows, provider-fit matrix, prompt-layout recommendations, and the lowest-complexity next step, each with an evidence level.
3Managed optimization pilot
Monthly plus usageWe apply provider-native caching, batch tiers, and alias routing on a live lane and report the realized capture, not the predicted one.
4BYOK where it helps
Your keys, our routingBring your own provider keys when contracts or data terms call for it, with the same reproducible routing and receipts.
5BYOC or hybrid pilot
Only when replay proves itSelf-hosted KV orchestration is an escalation. We run the managed baseline against the BYOC candidate on your captured traffic and only recommend it if the signed report says it wins.
The diagnostic is built to be useful even when it concludes BYOC is unnecessary. Most workloads capture most of their reuse on managed providers once the prompt layout is fixed and a batch lane is in place.
The waterfall is the whole point.
It separates what could be reused from what actually was. The gap between candidate and realized tokens is the missed opportunity, and it usually points straight at prompt ordering, not hardware.
A repeated prompt is not a cache hitFeed it what you already have.
Metadata-only is the default and enough to compute opportunity. Send richer inputs and the evidence level on each measurement climbs.
Try the estimators first
Both run in your browser. They will not replace a diagnostic, but they sharpen the questions you bring to one.
What the paid diagnostic returns.
Executive report
Savings range, latency opportunity, and the recommended execution profile.
Engineering report
Prefix families, branching behavior, retention windows, and missed reuse.
Workload Reuse Score
Overall score with the six-component breakdown.
Reuse waterfall
Total input, eligible, candidate, realized, and missed-opportunity tokens.
Provider-fit matrix
Managed-provider features, BYOK implications, and any BYOC justification.
Prompt-layout recommendations
Stable-prefix opportunities and serialization fixes.
Alias reproducibility audit
Model-resolution drift and versioning gaps.
Purge-guarantee map
Profile-specific retention and evidence limits.
Replay package
Baseline manifest and candidate experiments.
Pilot plan
The smallest safe optimization sequence to run next.
The diagnostic, answered.
Does the free scan require a credit card?
No. The free workload scan has no payment method and is metered separately with its own hard cap. It exists so the first conversation is about evidence, not commitment.
What do I have to send?
Whatever you already have - provider usage exports, metadata traces, SDK traces, prompt templates, bills, or latency dashboards. Metadata-only is enough to compute opportunity without storing prompt content.
What if the answer is that BYOC is unnecessary?
Then the diagnostic did its job. It is designed to be valuable even when it concludes you should stay on managed providers; most workloads capture most of their reuse there.
How is BYOC decided?
By replay. We run the managed baseline against the BYOC candidate on your captured traffic and read the signed report. A measured cost or latency win is a decision; a hunch is not.
Start with a free scan.
No payment method, metadata only. Get a Workload Reuse Score and the reuse waterfall, then decide what is worth paying to fix.