How to benchmark agent workloads without storing raw prompts

The instinct when benchmarking an agent is to capture everything - full prompts, full completions, every tool call - and analyze later. It is also the instinct that turns a benchmark into a data-retention problem and a security review you did not budget for. The good news is that most of the numbers worth having do not need the raw content at all.

You can compute opportunity, capture, and latency from metadata, and reserve raw samples for the rare case that genuinely needs them.

Hash the prefix, do not keep it

To know whether a reusable block recurs, you do not need the block. You need a stable fingerprint of it. Hash each candidate-reusable segment and key a prefix family on the hash. Two requests with the same fingerprint belong to the same family; you can count recurrence, measure reuse windows, and build the opportunity ratio without ever storing a byte of the underlying text.

Salt and rotate the hashing scheme so fingerprints are not reversible across tenants, and you have a recurrence signal that survives a privacy review intact.

Capture comes from usage, not content

Realized reuse is even easier to measure privately, because the provider hands it to you. OpenAI reports cached tokens in the usage block; other providers expose similar counters or runtime signals. Capture rate is realized reused tokens over candidate reusable tokens, and both terms come from token counts and usage metadata, not from reading the prompt. Tag each measurement with an evidence level so a provider-reported number is never confused with a router-inferred estimate.

TTFT and prefill behavior are the same story: they live in latency traces and token counts. None of it requires the text.

When you do need raw samples

Some questions genuinely need content - debugging why a specific prefix is not matching, or validating that a serialization fix did what you think. For those, take a small, time-boxed, encrypted sample with an explicit retention limit and a deletion path, rather than capturing everything by default. The sample is the exception that proves the rule, not the standard operating mode.

Treat the full-fidelity capture like a controlled substance: minimal quantity, clear expiry, audited access.

The benchmark is more trustworthy this way

Metadata-only benchmarking is not a privacy compromise that costs you rigor. It often improves it. Fingerprints force you to define what "the same prefix" means precisely. Usage-based capture ties your numbers to what the provider actually billed. And because you are not sitting on a lake of raw prompts, the benchmark is something you can run on a customer's traffic without a month of legal review first. That is the difference between a method you can deploy and a slide you can only show.

How to benchmark agent workloads without storing raw prompts

Hash the prefix, do not keep it

Capture comes from usage, not content

When you do need raw samples

The benchmark is more trustworthy this way

Keep going.

A repeated prompt is not a cache hit

Scoring a workload before you change infrastructure

Anthropic vs OpenAI prompt caching, measured

Turn the idea into a measurement.