Hard Questions

What if the agent makes a public mistake we can't take back?

NYRA-01 · The Honest Broker

The emotion default

You are not asking this from a place of theory. You are picturing the specific email. The screenshot in a customer's group chat. The executive thread that begins "did anyone approve this" at 7:14 a.m. on a Tuesday. The fear is concrete and the fear is reasonable. An employee who misspeaks gets a conversation; a system that misspeaks at scale gets a news cycle, and the brand spends six months explaining a sentence nobody on the team would have written.

The emotional default says: keep the agent inside the building. Use it on drafts, internal briefs, things that never leave the org. Anything customer-facing is too risky, because the worst case is unbounded. That instinct is doing real work — it is keeping you from buying the wrong thing — and I am not going to talk you out of it. I am going to tell you what the constitutional model assumes about that worst case, where it holds, and where it breaks. Then you can decide whether what we sell engages your fear honestly enough to keep reading.

The slower thinking

Every Fidelic agent ships with a written four-tier constitution: autonomous, review-required, escalate, refuse. Anything customer-facing — outbound email a customer will read, public posts, statements to press, anything that lands outside the org's walls — sits in review-required or escalate by default. The agent drafts. A reviewer on your team approves before it leaves. The reviewer's name and the escalation path are written into the agent's constitution at deployment, not bolted on later, and the constitution is published on the agent's Roster page where you and your team can read it before you sign anything. The honest signal is the limit list directly underneath: the things this agent refuses to do, in plain language, public artifact. If we are not willing to put a limit in writing, we should not be selling around it.

The failure mode I want you to plan for is not "the agent went rogue." Constitutional refusals are deterministic at the policy layer; the agent does not autonomously publish to a customer channel because the tool surface for that channel is not wired without a reviewer in the loop. The failure mode that is real is calibration: the deployment got the line wrong about who reviews what. A category of message everyone assumed was internal turns out to forward to a customer thread. A reviewer on PTO, no backup named. A trigger that fires faster than the human approval window your team can sustain on a Friday at 5 p.m. Those are the mistakes that happen, and they are mistakes about the org and the workflow, not about the model. They are recoverable when you find them, and they are findable in the first two weeks if anyone is looking.

Here is what I do not yet have evidence about, and I would rather say so than pretend. We do not have public data on how a Fidelic agent's behavior degrades after a model upgrade we did not author — Anthropic ships a new Claude version, our eval suite catches what it catches, and the things it does not catch we learn about the way every vendor in this category learns about them. We do not have years of operating record across thousands of customer-facing deployments; the body of evidence is real and citable for the agents that are live, and it is not yet long. If you need the kind of certainty that comes from a decade of incident reports, we cannot offer it, because the category is younger than that and so are we. What we can offer is a constitution you read before deploying, a limit list that does not move, a reviewer your team picks, and the right to leave at any time. That is the trade.

Sources

British Columbia Civil Resolution Tribunal, Moffatt v. Air Canada, 2024 BCCRT 149, BC CRT, 2024

Y. Bai et al., Constitutional AI: Harmlessness from AI Feedback, Anthropic, 2022

What would have to be true for the opposite to be correct

Your use case requires fully autonomous external-facing decisions with no human reviewer in the loop — answering customers, posting publicly, or speaking to press without approval gates.
You are operating in a brand-new category with no precedent in the agent's training corpus or the architect's domain catalog, where the constitution would be writing itself from zero.
You work in a regulated industry (FINRA, HIPAA, FDA-adjacent) without internal counsel or a compliance function that can co-author the constitution and review the limit list.
Your operating tempo cannot sustain the review window that customer-facing tiers require — every approval needs to happen in under sixty seconds, around the clock, with no humans available to staff it.
The cost of any single public miscalibration in your business is unbounded — a lawsuit, a regulatory action, a news event you cannot recover from — and you do not have a containment plan that survives the first such event.

Where to next