Measure first. Escalate only when the evidence earns it.

There is one sensible path through inference optimization, and it does not start with buying GPUs. It starts with a free scan, follows the evidence, and reaches BYOC only if a replay proves the case.

One path, five stops, no shortcuts.

1Free workload scan

No payment method

A metadata-only analysis with its own hard cap. It returns a Workload Reuse Score and the reuse waterfall so you can see whether there is anything to chase before spending a dollar.

2Paid diagnostic

Fixed engagement

The full Agent Workload Efficiency Diagnostic: prefix families, retention windows, provider-fit matrix, prompt-layout recommendations, and the lowest-complexity next step, each with an evidence level.

3Managed optimization pilot

Monthly plus usage

We apply provider-native caching, batch tiers, and alias routing on a live lane and report the realized capture, not the predicted one.

4BYOK where it helps

Your keys, our routing

Bring your own provider keys when contracts or data terms call for it, with the same reproducible routing and receipts.

5BYOC or hybrid pilot

Only when replay proves it

Self-hosted KV orchestration is an escalation. We run the managed baseline against the BYOC candidate on your captured traffic and only recommend it if the signed report says it wins.

Principle

The diagnostic is built to be useful even when it concludes BYOC is unnecessary. Most workloads capture most of their reuse on managed providers once the prompt layout is fixed and a batch lane is in place.

What it looks like

The waterfall is the whole point.

It separates what could be reused from what actually was. The gap between candidate and realized tokens is the missed opportunity, and it usually points straight at prompt ordering, not hardware.

A repeated prompt is not a cache hit
Reuse waterfallper request
Total input tokens100%
Eligible reuse78%
Candidate reuse66%
Realized reuse41%
Missed opportunity25%
Inputs

Feed it what you already have.

Metadata-only is the default and enough to compute opportunity. Send richer inputs and the evidence level on each measurement climbs.

Provider usage exportsProxy-captured metadata tracesSDK tracesTokenized trace bundlesEncrypted full-fidelity samplesExisting prompt templatesProvider billsLatency dashboards and SLOs

Try the estimators first

Both run in your browser. They will not replace a diagnostic, but they sharpen the questions you bring to one.

What the paid diagnostic returns.

Executive report

Savings range, latency opportunity, and the recommended execution profile.

Engineering report

Prefix families, branching behavior, retention windows, and missed reuse.

Workload Reuse Score

Overall score with the six-component breakdown.

Reuse waterfall

Total input, eligible, candidate, realized, and missed-opportunity tokens.

Provider-fit matrix

Managed-provider features, BYOK implications, and any BYOC justification.

Prompt-layout recommendations

Stable-prefix opportunities and serialization fixes.

Alias reproducibility audit

Model-resolution drift and versioning gaps.

Purge-guarantee map

Profile-specific retention and evidence limits.

Replay package

Baseline manifest and candidate experiments.

Pilot plan

The smallest safe optimization sequence to run next.

The diagnostic, answered.

Does the free scan require a credit card?

No. The free workload scan has no payment method and is metered separately with its own hard cap. It exists so the first conversation is about evidence, not commitment.

What do I have to send?

Whatever you already have - provider usage exports, metadata traces, SDK traces, prompt templates, bills, or latency dashboards. Metadata-only is enough to compute opportunity without storing prompt content.

What if the answer is that BYOC is unnecessary?

Then the diagnostic did its job. It is designed to be valuable even when it concludes you should stay on managed providers; most workloads capture most of their reuse there.

How is BYOC decided?

By replay. We run the managed baseline against the BYOC candidate on your captured traffic and read the signed report. A measured cost or latency win is a decision; a hunch is not.

Start with a free scan.

No payment method, metadata only. Get a Workload Reuse Score and the reuse waterfall, then decide what is worth paying to fix.