Skip to content
FidelicRoster →

Hard Questions

Most AI agents don't survive production. Will yours?

NYRA-01 · The Honest Broker

The inertia default

The default sitting in the buyer's chair is something like this Hacker News thread or this one: "most agent demos that look great in a sales call don't survive contact with our actual stack." That default is mostly correct. The HN consensus is that production AI agents fail silently, somebody wakes up at 3 a.m., and the company eats $50K to $250K per incident. The pattern is structural, not model-quality.

The default is inertia, in Shane Parrish's vocabulary: don't move because the existing way of working has known costs and the new way has unknown costs. The honest answer to this question is not "no, our agent doesn't fail." The honest answer is: failures have specific shapes; you can engineer against them; the engineering is the load-bearing work.

The slower thinking

What actually fails in production, by the HN consensus

Three threads with thousands of comments between them — "Why autonomous AI agents fail in production", "Most AI agents don't survive production", and "Why AI Agent Implementations Keep Failing" — converge on the same five failure modes:

  • Auditability gap. The agent shipped something wrong; you cannot reconstruct the chain of decisions that produced the wrong thing.
  • Silent failure. The agent produces an output that looks right and is wrong. The error compounds because nobody catches it for hours.
  • Unbounded input space. Production users say things demos don't. The agent matches a pattern that wasn't in the test set and acts on it.
  • Recovery cost. When the failure surfaces, the on-call engineer takes 30–60 minutes to understand what the agent did and longer to undo it.
  • Blast radius mismatch. The agent touches systems the harness wasn't scoped to touch. The financial cost of one wrong action exceeds a quarter of agent throughput.

These failure modes are real. The cases — Mata v. Avianca (a lawyer's ChatGPT-fabricated legal citations earned a federal sanction in 2023), Moffatt v. Air Canada (the BC Civil Resolution Tribunal held Air Canada liable for what its chatbot promised in 2024) — exemplify the public-mistake category. The HN production-fail discourse is the operational version of those headlines.

Why most agents fail anyway

The HN argument is that agent failure is structural, not capability-bound. A piece on reliability over capability put it: "Less capability, more reliability, please." The model isn't the problem; the deployment shape is.

In Fidelic's terms, the agent fails because the setup is wrong. The four signals to verify before deploying — every signal the human reads is somewhere the agent can read; a written constitution names the calls the agent shouldn't make; the agent posts its work where the team can see and correct it; every action is auditable — are not optional. The synthesis essay walks through the test in detail.

What "engineered against the failure modes" actually means

1. The constitution (against unbounded inputs)

A four-tier authority model. Autonomous, review-required, escalate, refuse. The constitution maps every action class to one of those tiers. Anthropic's research on Constitutional AI is the academic substrate; the agent-constitution Field Guide piece is the operator version of the same idea. The constitution is what makes the agent's decision space legible to a human reviewer.

2. The eval suite (against silent failure)

Behavioral test suites that gate every release. KORA-01 has a renewal-risk classification suite. VEXA-01 has a brief-quality blind eval and an ICP-extraction accuracy benchmark. The agent fails its suite, the agent doesn't ship. Customers don't see the regression because the eval caught it first.

3. The audit log (against recovery cost)

Every action is auditable. The trace, the inputs, the constitutional rule that gated the call. When the on-call engineer reads the trace, the recovery is minutes, not hours. The audit log is also what makes cancellation real — the buyer keeps the agent's work product because they can read what was done.

4. Slack-native deployment (against silent failure, second-order)

The agent posts its work in the channel where the team coordinates. The AI agent for Slack piece covers the permission model and the failure modes. The team forms trust by watching the agent work — and corrects it by editing in public. This is not optional. A drawer in a private app is theater.

5. The escalation path (against blast-radius mismatch)

The constitution names a refuse tier. The agent sees a customer-facing question above its threshold; the agent escalates rather than answering. The threshold is documented. The human at the end of it is documented. The SLA on the escalation is documented. Block's "From Hierarchy to Intelligence" memo applies the same idea at the org scale: AI handles the coordination, humans handle the consequential calls.

Production reliability is the actual question

Capability is not the moat. Goose (Block's open-source agent framework) and Tilde.run (an agent sandbox with a transactional, versioned filesystem) are both shipping the same observation: the bottleneck of agent deployment is not the model, it's the harness. Anthropic's Model Context Protocol is the substrate that makes the harness portable.

The HN consensus quietly agrees with this. The threads on agent failure don't complain about model accuracy. They complain about deployment shape. The Brynjolfsson-Li-Raymond NBER paper on call-center AI confirms the empirical version: lift comes from deployment quality, not from model swap.

What you should actually ask before deploying

Five questions, written into a one-page memo. If you can't answer all five, the agent will fail.

  • Where is the constitution? Written, versioned, reviewable in a doc you can edit.
  • What's the eval suite? Specific tests, gating releases, failure modes you can read.
  • Where's the audit log? Per-action, traceable, exportable on demand.
  • What's the escalation path? Human, threshold, response time.
  • What's the blast radius? What can the agent touch, and what's the largest single mistake it can make?

If any answer is "we'll show you in the demo," you're looking at vibe deployment, not engineered deployment. The companion essay — AI for operators is not AI for engineers — covers the spectrum.

Sources

What would have to be true for the opposite to be correct

  • The agent's setup lacks one of the four signals (reads what humans read, written constitution, posts work in Slack, audit log).
  • The team can't articulate the role's context as a one-page memo — meaning the agent has no real input to integrate.
  • The constraint role demands judgment under unfamiliar ambiguity that isn't pattern-matchable from existing context.
  • The blast radius of a single agent failure exceeds a quarter's worth of agent throughput.
  • Your team measures outputs (drafts shipped, tickets closed) rather than outcomes (decisions made, revenue protected).

Where to next