Back to Blog
Engineering
March 13, 20266 min read

The Next Outage Won't Be a Bug. It'll Be an Agent.

Agents are probabilistic actors operating inside deterministic systems. Every action is a small game of chance. As agents touch more production infrastructure, the failure model is changing—and so must the controls.

The Old Failure Model

For decades, production failures were mostly deterministic.

A bad config shipped. A dependency changed. A race condition escaped review. The root cause usually existed somewhere in code, and with enough logs and patience, engineers could reconstruct what happened.

Agents change that failure model.

Probabilistic Actors in Deterministic Systems

An agent is not just another automation script. It is a probabilistic actor operating inside deterministic systems.

It observes incomplete state, compresses that state into tokens, predicts an action, and often declares success before the environment has actually been verified. Even when the reasoning looks plausible, the action may still be wrong because the world it acted on was only partially understood.

The Core Problem

Every agent action is a small game of chance. Most of the time, it lands close enough to expectation. Occasionally, it does not.

Because the agent itself often cannot distinguish between "the action executed" and "the system reached the intended state," failures can compound silently across steps.

This is where outages begin.

Failure Modes You Haven't Seen Yet

The dangerous part is not that agents are unintelligent. It is that they are non-deterministic actors touching deterministic infrastructure.

Consider the failure patterns emerging in production:

  • Selector drift in a browser flow causes the agent to click the wrong element
  • Config changes applied to the wrong environment because the agent misread context
  • Permission inheritance where a cloud agent acquires broader permissions than intended
  • Remediation loops that retry against stale state until they amplify the original problem

The Amazon "Kiro" incident is a perfect example: an AI coding assistant attempted terraform destroy -auto-approve on production infrastructure. The agent had valid credentials, good intentions, and SOP-compliant reasoning. It was trying to "help" fix a corrupted Terraform state file.

Kiro Demo - Predicate Authority blocking terraform destroy

A reenactment of the Kiro incident: Predicate Authority intercepts the destructive terraform destroy command at the OS level, before it can execute—even though the agent had valid AWS credentials. (Run the demo yourself)

The agent wasn't malicious. It was helpful. And that's worse—because you can't firewall against helpfulness with permission boundaries.

The Assumption That's Breaking

Traditional software assumed the person approving a change understood the system well enough to reason about blast radius.

That assumption weakens when the actor generating changes is a model that may produce a different action path under slightly different context, latency, or intermediate state.

The operational question is no longer:

Did the agent produce a reasonable answer?

It becomes:

Was the action bounded, did the environment change as expected, and can we prove what happened afterward?

The Three Requirements

That shift requires three things:

1. Authority before execution

Explicit limits on what an agent may touch. Not "the agent has AWS credentials" but "the agent can run terraform plan but not terraform destroy." This is what a runtime trust infrastructure provides—a policy-based execution gate that evaluates every action before it reaches the OS.

2. Verification after execution

Checking that state actually mutated as intended. Not "the agent said it worked" but "the URL now contains /checkout/success and the confirmation element exists in the DOM." This is deterministic verification—code-based assertions, not LLM-as-judge.

3. Replay after failure

Preserving enough evidence to explain the incident. Not "we have logs" but "we can reconstruct exactly what the agent observed, what it decided, and what changed." This is trace-based observability—snapshots, context, and outcomes at every step.

Without These

Agent success is often inferred rather than proven. And inferred success is dangerous in production.

The Scale Problem

As agents move beyond coding assistants into infrastructure, workflows, deployment systems, and operational tooling, the next serious outage is increasingly likely to come from an action that looked reasonable locally but was wrong globally.

Not because the model was malicious.

Because probability eventually loses to scale.

The more agents touch production, the less acceptable "probably correct" becomes as a control model.

Survival

The systems that survive this shift will not be the ones with the most autonomous agents.

They will be the ones where every important action is bounded, verified, and explainable.

This is the operational pattern that distinguishes governable agents from ungovernable automation: not better reasoning, but external accountability at the execution layer.

Add Execution Gates to Your Agents

Predicate Authority provides sub-millisecond policy enforcement for AI agents. Block destructive actions before they reach your infrastructure.

Read the Quickstart