The Governance Gap in Agent-Stack Thinking

Addy Osmani published The Agent Stack Bet a little while ago and it's getting the attention it deserves. He names four infrastructure bets that teams building production agents need to place: dedicated agent identity, universal context integration, persistent durable execution, and purpose-built platform primitives over DIY plumbing. The framing is right, and his list is almost complete.

Almost. What Osmani describes is the infrastructure that lets agents operate. He's largely silent on the infrastructure that makes operating them safe. The gap shows in month three, not the first sprint - when someone has to account for what the agent did, not just whether it's running.

The fifth bet is runtime governance. The Cloud Security Alliance found that only 16% of enterprises currently govern AI agent access to core business systems effectively.

What the four bets actually buy you

The four bets are real and the industry is under-invested in all of them. Agent identity matters because agents operating on shared credentials are impossible to audit and trivially compromised. Context integration matters because an agent reasoning from thin or stale information is worse than useless - it's confidently wrong. Persistent durable execution matters because multi-step workflows that can't survive a restart or a credential rotation can't do real work. Building on platform primitives rather than hand-rolling infrastructure is sound engineering at any scale.

But notice what those four bets describe: the agent as a machine. A machine with an identity, access to data, an ability to run for a long time, and a well-built chassis. They don't describe who operates that machine, what it's allowed to do under different conditions, how you inspect what it did, or who is responsible when it acts outside its intended scope.

Osmani's piece is about infrastructure for building agents. When teams move from "this works in staging" to "this is running in production on 40 workflows", they discover that infrastructure is necessary but not sufficient. The gap is operational.

What governance debt actually looks like

Osmani calls this "governance debt" - his phrase for the silent accumulation of security and audit risk that eventually forces a full rewrite, usually right after the first incident that reaches the CISO. The frame is right.

An agent with the four bets in place can run cleanly for weeks across dozens of production workflows. Then it takes an action that shouldn't have happened. Maybe it escalated a ticket to an external partner using a template that was out of date. Maybe it triggered a deployment to a production environment during a freeze window because it didn't have visibility into the freeze state. Maybe it queried a data source that had recently been reclassified as sensitive.

The incident review happens. The question is simple: why did it do that?

Agents do produce decision traces. The model usually surfaces what it reasoned about, what it called, and what it tried. The problem at production scale isn't the absence of traces - it's that raw model traces aren't structured for accountability. Without an audit trail that captures what context the agent saw at runtime, what policy it operated under, and what decision pathway led to that specific action, you can't answer the question that actually matters: why was it allowed to do that?

That's governance debt coming due. It's a showstopper. Engineering leadership, legal, compliance - they don't care how impressive the efficiency ratio is. They care whether you can account for what the system did. When it arrives, the failure tends to follow a recognisable shape: the agent ran cleanly for weeks, then took one action nobody had authorised, and the question of who was responsible stalled the rollout regardless of how the infrastructure had performed.

The three things governance actually is

Runtime governance covers three distinct functions, and they have to work together.

Policy enforcement

An agent with a valid identity and access to correct context can still take actions outside its intended scope. The distinction matters: identity establishes who the agent is, policy establishes what it can do right now. Those are different questions with different infrastructure answers. Osmani correctly argues that policy should be enforced at the platform level, not in application middleware. But that principle needs to be cashed out operationally. Runtime governance means a policy layer that evaluates each agent action against current rules before executing it, not after. Not a system prompt saying "don't touch production". An infrastructure-level enforcement point that determines what the agent can do before it does it.

The policy needs to be dynamic, too. A deployment agent that has write access during normal operating hours should not have the same access during an active incident, or during a code freeze, or when the target service is in a degraded state. Static permission grants don't handle this. Runtime policy enforcement does.

Context quality standards

This is the one that surprises most teams. You've built the context layer. You've integrated your sources. The agent has what it needs.

Context has a quality dimension that's separate from its existence. A deployment record from three weeks ago tells you less than one from three hours ago. An ownership record created before a reorg may point to a team that no longer owns the service. A runbook never validated against current infrastructure may be accurate, or may be subtly wrong in ways that only show up in edge cases.

Without provenance tracking - where did this fact come from, when was it last verified, how should conflicts between sources be handled - the agent consumes data of unknown reliability. At small scale that's manageable. At production scale, with agents acting on context across hundreds of services simultaneously, an untracked staleness problem propagates into dozens of decisions before anyone notices. Governance includes the standards that keep context trustworthy, not just the pipeline for ingesting it.

The governance question here is accountability: not just who built the pipeline, but who signs off that the context an agent is about to act on is trustworthy enough for the action it's about to take. That accountability has to be explicit. If it isn't, it defaults to nobody, which means the agent is operating without a quality floor.

Designed human oversight

Osmani mentions human-in-the-loop approval gates as part of his persistent execution bet. Right call, but the framing can be tightened. Human-in-the-loop should be a governance design pattern, not a recovery mechanism you activate when something goes wrong.

The difference is architecture. Recovery-mode oversight says: pause the agent when it's about to do something catastrophic. For that to work, you need to know in advance what "catastrophic" looks like, and you need to have defined the triggers correctly. In production, you won't always know. The novel failure modes - the ones that damage trust are usually the ones nobody anticipated.

Designed oversight says: at these specific points in the workflow, a human reviews the agent's proposed action before it runs. Not because you expect failure, but because the workflow has high enough stakes that human judgment belongs in the loop by design.

When the ratio of agent actions to human decisions reaches production scale, the humans aren't reviewing everything - and they shouldn't be. The whole point is to get humans out of the routine path. Governance determines what the humans do review: the decision points where errors compound, where actions are irreversible, where the agent is operating at the edge of its validated context. You have to design those checkpoints in advance, not discover the need for them afterwards.

Why the IDP is the natural governance layer

The hard parts of governance infrastructure are largely already built - for humans.

A mature internal developer portal already governs what developers can do. It controls which scaffolding templates are available. It enforces which deployment targets a team can push to. It gates access to production systems. It tracks ownership, so every service has a named team accountable for it. It records the relationship between teams, services, APIs, and dependencies.

Extending that governance to agents is not starting from scratch. The portal already knows the ownership graph. It already has the policy model for what different teams can access and change. It already maintains the service topology that tells you what an agent is allowed to touch on behalf of which team.

The audit trail question - who ran this, from what state, and when? - is the same question the portal already answers for human actions. The infrastructure for answering it is the same infrastructure runtime governance for agents needs.

I've argued before that the biggest mistake platform teams make is treating agent deployment as a technical problem when it's an organisational one. You can't measure deployment frequency across your organisation until you agree on what a deployment is. No tool can solve that alignment problem for you. The same logic applies to governance. You can't enforce what agents are allowed to do until you've agreed on what they should be allowed to do - and that agreement has to exist at the team level, the service level, and the environment level simultaneously. The portal is where those agreements already live, because it's where platform teams have spent years capturing them.

Platform teams are positioned to own the governance layer because they already own the hard parts. They understand what "context quality" means in their environment because they've spent years keeping the catalogue accurate for humans. The policy model already exists because they've spent years managing what developers are allowed to do. Runtime governance for agents extends that practice.

The fifth bet

Osmani asks what happens to teams that don't place the four bets. They stay trapped at the demo stage - agents that impress in staging and fail in production.

The governance gap creates a different but equally costly trap. Teams place the four bets correctly. They build something that genuinely works. They scale it to production. Then they get shut down after the first serious accountability failure - not because the infrastructure was wrong, but because there was no governance layer to make it auditable, policy-constrained, and safe to operate at the scale they'd reached. Gartner projects that over 40% of agentic AI projects will be cancelled by 2027 due to inadequate risk controls.

Build it early. Policy enforcement, context quality standards, and designed human oversight are much cheaper to add before an agent is running 40 production workflows than after you're trying to reconstruct why one of them did something wrong.

The fifth bet doesn't generate the conference talks. It generates the confidence to keep the programme running past month three.