The pilot worked. The agent read the case file, drafted the report, flagged the exception, and did it in seconds instead of the half-day a person needs. Everyone in the room agreed it was good. Then someone asked the question that kills agent projects in regulated firms: if we put this in production and it gets one wrong, who answers for it? The room went quiet, and the project went back into the sandbox.

This is the pattern. Across pharmacovigilance, medical-device complaint handling, quality assurance, and the GMP shop floor, agent pilots are not failing on accuracy. They are failing on accountability. The agent can do the work. Nobody can prove what it was allowed to do, who signed off, or that the record of it is intact. In an industry where a person has to put their name to decisions, an actor that cannot be held to account does not graduate from pilot. It cannot.

The real reason pilots stall

It is tempting to blame model quality, integration complexity, or budget. Those are real, but they are not what stops the green-lit pilot from going live. The stop is governance, and it shows up as a set of questions a compliance lead is personally on the hook to answer.

Who authorized this agent to take this action, on this date?
Could the same agent that prepared the work also approve it?
When a human signed off, can you prove it was a human, and which one?
If a regulator subpoenas the trail, can you show it was not edited after the fact?

In a manual process, these answers are baked into how the work is organized. A maker prepares; a checker reviews; a quality unit signs; the records sit in a validated system. The control is the org chart. When you drop an agent into that process, the org chart no longer describes who did what, and the answers evaporate. The pilot stalls not because the agent is bad, but because the firm has lost the thing that made the manual process defensible.

Why "we'll watch it closely" is not a plan

The usual bridge from pilot to production is a human in the loop. A person checks the agent's output before it counts. This feels safe, and for a single low-volume workflow it can be. But it does not scale, and it does not produce evidence.

A reviewer skimming a queue of agent outputs leaves no structured record of what they actually verified. There is no proof the reviewer was distinct from the process that generated the work, no version history of what the agent was permitted to do that week, and no tamper-evident trail when an examiner arrives eighteen months later. "We watched it closely" is a statement of intent. It is not the kind of answer that survives discovery.

The deeper problem: human-in-the-loop, done informally, recreates the manual bottleneck you bought the agent to remove. You get the cost of human review without the evidentiary strength of a real control. That is the worst of both, and it is why "let's just review everything for now" pilots quietly never ship.

The false trade-off: speed or control

The unstated assumption behind a stalled pilot is that you face a choice. Move fast and accept that the agent is unaccountable, or stay accountable and lose the speed that justified the project. Most regulated firms, sensibly, choose accountability, and so the agent never leaves the lab.

That trade-off is false. It exists only because the limits on the agent live inside the agent, in its prompt, its tool list, its training. Those are not controls. A prompt instruction is a request the agent can ignore, drift from, or be talked out of, and it produces no record an auditor can examine. As long as the boundary lives where the agent can edit it, you genuinely cannot have both speed and proof.

Move the boundary somewhere the agent cannot reach, and the trade-off disappears. The agent runs at full speed inside an enforced perimeter that records everything. Speed comes from the agent. Accountability comes from the layer around it. They stop competing.

What actually unblocks production

That layer is a control plane: a thin governance layer that sits between the agent and the real world, separate from the agent itself. The agent proposes an action; the control plane decides whether that actor is authorized, gates the action when a human must sign, and writes a tamper-evident record. It is the same idea as the org chart that made the manual process defensible, but enforced in software, on actors that are not people.

Four things turn a pilot into a production system an examiner can accept.

Named identity and deny-by-default authority. Every agent is a named principal holding one role, and that role can do only what it was explicitly granted, nothing else. The grants are versioned, so you can reconstruct exactly what the agent was permitted to do on any past date, and who approved each change. This is the answer to who authorized this action.

Structural segregation of duties. The agent that prepared a piece of work cannot be the one that approves it, not as policy, but enforced inside the run. This is the maker-checker principle, the oldest control in quality and safety, applied to machines. It maps directly onto 21 CFR 211.22, the pharma quality unit's separation of duties, and the four-eyes standard behind a qualified person's batch release under EU GMP Annex 16.

Human approval gates that mean something. Some actions are one-way doors: releasing a batch, setting the seriousness call on an adverse-event case that starts the 15-day reporting clock, filing a device reportability determination. For those, the run parks and demands a named human signature before it proceeds. A real approval gate can require a quorum, it bars the requester from approving their own request, and it captures the signer's reason verbatim, so the signature carries its meaning, the way 21 CFR 11.50 has long required of the people who sign batch releases under EU GMP Annex 16.

A tamper-evident, offline-verifiable trail. Every action, model call, and approval lands in an append-only, hash-chained, cryptographically signed ledger. Change one record and the chain visibly breaks. The export is an evidence bundle a third party can verify offline, with no access to your systems, the modern form of the tamper-evident audit trail 21 CFR 11.10(e) has demanded for decades.

Pilot question	What unblocks it
Who authorized this?	Versioned, deny-by-default grants
Could it approve itself?	Structural segregation of duties
Did a human really sign?	n-of-m approval gates with recorded meaning
Is the record intact?	Hash-chained, offline-verifiable audit

Why no one is waiting for a rulebook

The accountability question is not waiting for a regulator to define an agent-specific template, so neither should you. There may be no purpose-built agent-controls rulebook yet, and the EU AI Act's high-risk obligations slip well into 2027. But the predicate rules that govern what a human in that seat must do never moved. 21 CFR Part 11 requires recorded-meaning signatures and intact audit trails; ICH-GCP reserves the eligibility determination for a named investigator; 21 CFR Part 803 puts a regulatory-affairs reviewer behind a device reportability call; EU GVP fixes the expedited-reporting clock on a safety physician's seriousness assessment. These stand on their own. The question they create, prove what the actor was allowed to do, prove the record is intact, outlives any guidance cycle.

So the exposure is the same whether or not an agent-specific template arrives. Inspectors still ask, discovery still demands the trail, and a named, accountable signer still has to answer for the call. Shipping an agent without a control layer does not put you ahead. It accumulates undocumented decisions someone will have to answer for, with no record to do it, exactly the position every regulated quality and safety function keeps finding itself in.

The firms moving agents into production are not the ones with the best model. They are the ones that solved accountability first, so that when the room asks who answers for it, the answer is already written down, signed, and provable. You do not have to choose between getting agents live and keeping control. You have to put the control somewhere the agent cannot touch.

See how it works, or book a demo to watch an agent get blocked from approving its own work, live.

Getting AI agents from pilot to production

The real reason pilots stall

Why "we'll watch it closely" is not a plan

The false trade-off: speed or control

What actually unblocks production

Why no one is waiting for a rulebook

How MakerChecker works, the six primitives

What is an AI agent control plane?

AI agent governance vs guardrails

See an agent get stopped.