Field Guide · framework
Why Your AI Agent's 95% Accuracy Is 60% in Production
A per-step reliability number is a tell. A 95% reliable agent over a ten-step task is 60% reliable end-to-end. The math is structural, and the demo is where it hides.
On r/AI_Agents in late 2025, a piece of math began circulating among non-engineer SMB buyers: an agent that is 95% reliable per step is only 60% reliable over ten steps, because reliability compounds. The thread that surfaced it (October 2025, 'The AI agent you're building will fail in production') accumulated 115 upvotes on the top comment and reshaped how the category's most-engaged buyers talk about evaluation. The math is not new — it has been a standard observation in software reliability for decades. What is new is that the math is now in the SMB buyer's vocabulary.
This essay is the structural explanation of that math, what it means for AI-agent purchase decisions, and how the architecture of an agent — not the marketing of the agent — determines whether it survives the compounding.
Why it matters
The reason this matters is that a vendor's demo runs one polished step on hand-curated data, and a production deployment runs ten messy steps on real customer data. The demo cannot show you the production reliability of the agent. The structural answer to whether the agent survives production is not in the demo; it is in the architecture.
The cancellation wave the AI-agent buyer voice has been narrating in early 2026 — the 11x and Artisan reversals, the Zendesk AI Copilot money-pit pattern — is not a story about bad vendors. It is a story about buyers who signed on a demo that hid the compounding math, and discovered it in production. The math was always going to assert itself. The bad pattern is buying without checking for it.
The compounding math, in three sentences
If an agent succeeds at each step with probability p, and the task requires n sequential steps, the probability the agent succeeds end-to-end is p raised to the n.
p = 0.95, n = 10, end-to-end = 0.598. About sixty percent of the time, the agent finishes the task. The other forty percent of the time, the agent fails somewhere along the chain.
p = 0.99, n = 10, end-to-end = 0.904. Even at ninety-nine percent per step, one in ten tasks fails. At ninety-five percent per step, four in ten fail.
Why demos hide this
A vendor demo runs one polished step. Step one is almost always the one the vendor has tuned hardest, on data the vendor has curated to look like the buyer's data. The success rate on step one is the success rate the vendor shows. The buyer reasonably extrapolates.
Production runs n steps. The data is dirty. The customer's CRM has three spellings of every name. The Slack channel the agent monitors has a year of legacy context the demo didn't include. The integrations behave slightly differently than the docs say they do. Each variance compounds.
Compounding is not a vendor's fault per se. It is structural to multi-step autonomous work. The vendor's responsibility is to publish the architecture that handles the compounding, not to deny it.
The architectural responses, ranked
There are four architectural moves an AI-agent platform can make against compounding reliability decay. They are not exclusive; the best architectures use all four.
One. Eval suites per agent. Run the agent against task-specific tests and edge-case scenarios at formation time and on every model update. Report the per-step success rate at the per-test granularity. The buyer reads the eval numbers, not the marketing.
Two. Canary deployment. Don't deploy a new agent build to all customers at once. Run on a fraction first; measure end-to-end success on real production data; compare against the prior build. Roll forward only if the eval holds.
Three. Refuse-tier discipline. The agent is constitutionally forbidden from acting past where it can verify. Below the verification threshold, the agent escalates, surfaces uncertainty, or refuses — rather than guessing forward and compounding the next step's reliability against an already-shaky one.
Four. Configuration-layer ownership of failure. When the agent fails in production, the fix is not the customer's job. The configuration agent on the vendor's side owns the retune, the data-quality fix, the prompt revision. The customer's role is to surface the failure; the vendor's role is to resolve it.
What to ask any vendor
Three questions, in order. They are the structural questions that separate vendors that have thought about compounding from vendors that have not.
Does each agent have a published per-task eval suite? If yes, can I read the eval results before I sign up? If no, the agent has not been measured against compounding, and the marketing reliability number is whatever the demo showed you.
What does the agent refuse to do? A specific list of refused work is the buyer-readable form of the refuse tier. A vendor that cannot name what their agent will not do is a vendor that has not designed the agent to stop at the verification boundary.
Who owns the fix when the agent misfires in production? If the answer is the buyer, you are also hiring a part-time AI-agent operator alongside the agent itself. If the answer is the vendor, the architecture matches the price.
The edge
The Sullivan & Cromwell hallucinated-citations filing in April 2026 is the canonical concrete case. The partner was not a junior attorney; the firm was not a junior firm; the model was not a primitive tool. The failure mode was the agent acting past where it could verify — fabricating citations rather than refusing to ship them — and the buyer was the one who paid the cost (sanctions, public letter to the judge, brand damage).
The structural answer is not 'AI is unreliable.' The structural answer is that the agent that drafted the brief had no constitutional discipline preventing the unverified-citation ship. A different agent, with a hard-coded refuse tier at the citation-verification boundary, could not have produced that filing.
We wrote the longer argument on that case at /guide/framework/ai-constitutions-prevent-sullivan-cromwell-failure. The point of citing it here is that the 95% math is not abstract. It produces real, dated, publicly-known failures, and the architecture that prevents them is published.
Honest take
We are not claiming Fidelic agents are 99.9% reliable per step, or that any AI agent currently is. The compounding math is structural to autonomous multi-step work on imperfect data. What we are claiming is that the architecture matters more than the headline reliability number, and that the four architectural moves above — eval suites, canary deployment, refuse-tier discipline, configuration-layer ownership — are the load-bearing ones.
On the eval suites: every Fidelic Roster agent has them, run at formation. The current published version covers task-specific tests and edge-case scenarios. The deeper version — buyer-readable eval results per agent — is on the roadmap and not yet shipped. When it ships, the buyer will be able to read the per-step success rate on the agent's actual production work before signing up.
On the refuse tier: every agent's constitution names refused work, and the four-tier authority model (autonomous / review-required / escalate / refuse) is published per agent on the Roster detail page. The buyer can read it before hiring.
The 95% number is a tell. A vendor that reports a per-step reliability and lets the buyer assume end-to-end is doing the math the buyer should be doing. A vendor that reports the end-to-end number, publishes the eval suite, names the refuse tier, and takes ownership of production failure is doing the math you can hold them to.
Bring the per-step number to any AI-agent purchase decision. Raise it to the power of the number of steps the task requires. The result is what you will see in production. If the result is below your tolerance, the architecture has to make up the difference, or the agent will not survive.