The pilot worked. The agent read the case file, drafted the report, flagged the exception, and did it in seconds instead of the half-day a person needs. Everyone in the room agreed it was good. Then someone asked the question that kills agent projects in regulated firms: if we put this in production and it gets one wrong, who answers for it? The room went quiet, and the project went back into the sandbox.
This is the pattern. Across pharmacovigilance, anti-money-laundering, quality assurance, and the bank middle office, agent pilots are not failing on accuracy. They are failing on accountability. The agent can do the work. Nobody can prove what it was allowed to do, who signed off, or that the record of it is intact. In an industry where a person has to put their name to decisions, an actor that cannot be held to account does not graduate from pilot. It cannot.
The real reason pilots stall
It is tempting to blame model quality, integration complexity, or budget. Those are real, but they are not what stops the green-lit pilot from going live. The stop is governance, and it shows up as a set of questions a compliance lead is personally on the hook to answer.
- Who authorized this agent to take this action, on this date?
- Could the same agent that prepared the work also approve it?
- When a human signed off, can you prove it was a human, and which one?
- If a regulator subpoenas the trail, can you show it was not edited after the fact?
In a manual process, these answers are baked into how the work is organized. A maker prepares; a checker reviews; a quality unit signs; the records sit in a validated system. The control is the org chart. When you drop an agent into that process, the org chart no longer describes who did what — and the answers evaporate. The pilot stalls not because the agent is bad, but because the firm has lost the thing that made the manual process defensible.
Why "we'll watch it closely" is not a plan
The usual bridge from pilot to production is a human in the loop. A person checks the agent's output before it counts. This feels safe, and for a single low-volume workflow it can be. But it does not scale, and it does not produce evidence.
A reviewer skimming a queue of agent outputs leaves no structured record of what they actually verified. There is no proof the reviewer was distinct from the process that generated the work, no version history of what the agent was permitted to do that week, and no tamper-evident trail when an examiner arrives eighteen months later. "We watched it closely" is a statement of intent. It is not the kind of answer that survives discovery.
The deeper problem: human-in-the-loop, done informally, recreates the manual bottleneck you bought the agent to remove. You get the cost of human review without the evidentiary strength of a real control. That is the worst of both, and it is why "let's just review everything for now" pilots quietly never ship.
The false trade-off: speed or control
The unstated assumption behind a stalled pilot is that you face a choice. Move fast and accept that the agent is unaccountable, or stay accountable and lose the speed that justified the project. Most regulated firms, sensibly, choose accountability — and so the agent never leaves the lab.
That trade-off is false. It exists only because the limits on the agent live inside the agent — in its prompt, its tool list, its training. Those are not controls. A prompt instruction is a request the agent can ignore, drift from, or be talked out of, and it produces no record an auditor can examine. As long as the boundary lives where the agent can edit it, you genuinely cannot have both speed and proof.
Move the boundary somewhere the agent cannot reach, and the trade-off disappears. The agent runs at full speed inside an enforced perimeter that records everything. Speed comes from the agent. Accountability comes from the layer around it. They stop competing.
What actually unblocks production
That layer is a control plane: a thin governance layer that sits between the agent and the real world, separate from the agent itself. The agent proposes an action; the control plane decides whether that actor is authorized, gates the action when a human must sign, and writes a tamper-evident record. It is the same idea as the org chart that made the manual process defensible — but enforced in software, on actors that are not people.
Four things turn a pilot into a production system an examiner can accept.
Named identity and deny-by-default authority. Every agent is a named principal holding one role, and that role can do only what it was explicitly granted — nothing else. The grants are versioned, so you can reconstruct exactly what the agent was permitted to do on any past date, and who approved each change. This is the answer to who authorized this action.
Structural segregation of duties. The agent that prepared a piece of work cannot be the one that approves it — not as policy, but enforced inside the run. This is the maker-checker principle, the oldest control in finance and quality, applied to machines. It maps directly onto 21 CFR 211.22, the pharma quality unit's separation of duties, and the Wolfsberg Group's four-eye standard in anti-money-laundering.
Human approval gates that mean something. Some actions are one-way doors: releasing a batch, filing a suspicious-activity report, pushing to live medical devices. For those, the run parks and demands a named human signature before it proceeds. A real approval gate can require a quorum, it bars the requester from approving their own request, and it captures the signer's reason verbatim — so the signature carries its meaning, the way 21 CFR 11.50 has long required of the people who sign batch releases under EU GMP Annex 16.
A tamper-evident, offline-verifiable trail. Every action, model call, and approval lands in an append-only, hash-chained, cryptographically signed ledger. Change one record and the chain visibly breaks. The export is an evidence bundle a third party can verify offline, with no access to your systems — the modern form of the tamper-evident audit trail 21 CFR 11.10(e) has demanded for decades.
| Pilot question | What unblocks it |
|---|---|
| Who authorized this? | Versioned, deny-by-default grants |
| Could it approve itself? | Structural segregation of duties |
| Did a human really sign? | n-of-m approval gates with recorded meaning |
| Is the record intact? | Hash-chained, offline-verifiable audit |
Why no one is waiting for a rulebook
The accountability question is not waiting for a regulator to define it, so neither should you. There is no agent-controls template to build against: in April 2026 the Federal Reserve issued SR 26-2, replacing the old model-risk guidance and explicitly scoping agentic AI out, and the EU AI Act's high-risk obligations were deferred to December 2027. But a missing rulebook changes nothing about who answers for the decision. The rules that govern what a human in that seat must do never moved, and the question they create — prove what the actor was allowed to do, prove the record is intact — outlives any guidance cycle.
So the exposure is the same whether or not the template arrives. Examiners still ask, discovery still demands the trail, and a personally-liable officer still has to account for the call. Shipping an agent without a control layer does not put you ahead. It accumulates undocumented decisions someone will have to answer for, with no record to do it — exactly the position the bank middle office keeps finding itself in.
The firms moving agents into production are not the ones with the best model. They are the ones that solved accountability first — so that when the room asks who answers for it, the answer is already written down, signed, and provable. You do not have to choose between getting agents live and keeping control. You have to put the control somewhere the agent cannot touch.
See how it works, or book a demo to watch an agent get blocked from approving its own work — live.