Back to Blog
Engineering
March 26, 202610 min read

Policy Answers 'Can This Execute?' Verification Answers 'Did Reality Move?'

Why authorization alone doesn't solve agent correctness—and what does.

If you're building agent runtimes, you've probably implemented some form of policy enforcement. Maybe it's a sidecar that gates tool calls. Maybe it's an allowlist of permitted actions. Maybe it's an LLM-as-judge that reviews requests before execution.

These are all reasonable. They're also insufficient.

Policy enforcement answers one question: can this action execute? But there's a second question that matters more for correctness: did reality actually change?

This post explains why the distinction matters and how to architect systems that answer both.


The Illusion of Healthy Execution

Consider a browser agent tasked with applying a discount filter on an e-commerce site. The execution trace looks clean:

1[14:23:01] ACTION: click(selector="#filter-discount")
2[14:23:01] POLICY: action=browser.click resource=https://shop.example.com → ALLOWED
3[14:23:02] RESULT: HTTP 200, no exception
4[14:23:02] STATUS: step completed

Policy allowed it. The browser didn't throw. The API returned 200.

But the filter didn't apply. The page reloaded with identical content. The agent proceeds to extract "discounted" items that aren't discounted.

This is what we call locally valid, globally wrong.

The action was permitted. The execution succeeded. The outcome was incorrect.


Why Policy Enforcement Doesn't Solve Correctness

Policy enforcement operates at the request boundary. It intercepts an action before execution and decides: should this proceed?

A well-designed policy engine evaluates rules against principal, action, and resource:

1// Simplified policy evaluation
2for rule in rules {
3  if matches(rule, request.principal, request.action, request.resource) {
4      match rule.effect {
5          Deny => return DENY,
6          Allow => return ALLOW_WITH_MANDATE,
7      }
8  }
9}
10return DENY  // default: fail-closed

This is essential for security. You don't want an agent making unauthorized HTTP requests or reading files outside its scope. Policy enforcement prevents execution of disallowed actions.

But policy can't answer:

  • Did the click actually trigger the expected state change?
  • Did the form submission succeed, or did it fail silently?
  • Is the page now showing the expected content?

Policy operates on intent. Verification operates on outcome.


The Failure Modes Policy Misses

Here are real scenarios where policy passes but execution fails:

1. Silent No-Op

The agent clicks a button. The click event fires. But the button was disabled by CSS, or the click handler had an early return, or the element was occluded by an overlay.

1Policy: ALLOWED (browser.click on target URL)
2Reality: Button click had no effect

2. Wrong Object Selected

The agent was supposed to select "Premium Plan" but selected "Basic Plan" due to a stale DOM reference or an element that shifted during render.

1Policy: ALLOWED (dropdown.select action)
2Reality: Wrong option selected, agent doesn't notice

3. Navigation Without State Change

The agent navigates to a new page. The URL changes. But the page content is a soft 404, or a loading state that never resolves, or an error page styled to look normal.

1Policy: ALLOWED (browser.navigate action)
2Reality: Page in error state, extraction will fail

4. Race Condition Wins

The agent submits a form. The server accepts it. But a concurrent process (another agent, a cron job, a human) modified the underlying data between read and write.

1Policy: ALLOWED (form.submit action)
2Reality: Stale write, data inconsistency

In all these cases, the logs look healthy. The policy trace shows green. The failure surfaces later—often as corrupted data or a confused downstream system.


Deterministic Post-Execution Verification

The solution is verification: deterministic assertions that run after execution to confirm the expected state change occurred.

In our Predicate Runtime SDK, this looks like:

1# Execute the action
2await runtime.click(selector="role=button text~'Apply Filter'")
3
4# Verify the outcome
5runtime.assert_(
6  snapshot_changed(),
7  label="filter_applied",
8  required=True
9)

The snapshot_changed() predicate compares a content digest before and after the action. If the digest is identical, the assertion fails—regardless of whether the click "succeeded."

Built-in Verification Predicates

PredicateWhat It Checks
url_contains(substring)Current URL includes expected path
url_matches(regex)URL matches pattern (useful for dynamic IDs)
exists(selector)Element is present in DOM
not_exists(selector)Element is absent (modal dismissed, error cleared)
snapshot_changed()Page content hash differs from before
is_enabled(selector)Form element is interactive
value_equals(selector, expected)Input field contains expected value
element_count(selector, min, max)Element count within range

These predicates are evaluated against a fresh snapshot—not cached state, not the LLM's belief about the page, but the actual DOM at verification time.

Compositional Verification

Predicates compose logically:

1# All conditions must pass
2all_of(
3  url_contains("/checkout"),
4  exists("role=heading text~'Order Summary'"),
5  not_exists("role=alert")
6)
7
8# Any condition passing is sufficient
9any_of(
10  exists("text~'Success'"),
11  exists("text~'Order Confirmed'")
12)

This enables precise, readable verification that matches the expected UI state.


Why Not LLM-as-Judge?

A common alternative is to ask an LLM to verify outcomes. After each action, screenshot the page and prompt: "Did this action succeed?"

This approach has three problems:

1. Non-Determinism

The same screenshot, evaluated twice, may produce different answers. This makes debugging impossible and introduces flakiness that compounds across multi-step tasks.

2. Cost and Latency

Vision model inference adds 1-5 seconds per verification step. For a 20-step task, that's 20-100 seconds of verification latency alone—plus the token cost.

3. Ungrounded Reasoning

LLMs hallucinate about UI state. They'll confidently assert that a button was clicked when it wasn't, or that a form was submitted when the page shows an error.

Deterministic predicates have none of these problems. url_contains("/cart") either passes or fails. There's no interpretation, no variability, no hallucination.


Why Logs Look Healthy While Runs Drift

The fundamental issue is that most agent observability is action-centric, not outcome-centric.

Standard logging captures:

  • What action was requested
  • Whether policy allowed it
  • Whether the action threw an exception
  • What the response code was

This tells you the action executed. It doesn't tell you the action worked.

Consider this trace:

1Step 1: click("Add to Cart") → ALLOWED, 200, no error
2Step 2: click("Checkout") → ALLOWED, 200, no error
3Step 3: fill("Email", "user@example.com") → ALLOWED, 200, no error
4Step 4: click("Place Order") → ALLOWED, 200, no error

Looks perfect. But what if:

  • Step 1 added the wrong item (element index off by one)
  • Step 3 filled the wrong field (two email inputs on page)
  • Step 4 triggered validation error (rendered as styled div, not thrown exception)

Without verification, you don't know until the user complains.

With verification:

1Step 1: click("Add to Cart")
2→ verify: exists("text~'Widget X'") in cart → FAIL
3→ artifact: screenshot showing "Widget Y" in cart

The failure surfaces immediately, with evidence.


The Predicate Systems Model

We frame the problem as two distinct concerns:

Policy = Execution Permission

Policy enforcement answers: Is this action allowed to execute?

This is a gate at the request boundary. It evaluates rules, checks principals, validates resources, and issues cryptographic mandates (short-lived tokens proving authorization).

Policy is about security and scope control. It prevents agents from doing things they shouldn't.

Verification = State Correctness

Verification answers: Did the action achieve the expected outcome?

This is a check after execution. It evaluates predicates against actual state, captures evidence, and gates step completion.

Verification is about correctness and reliability. It catches silent failures that policy can't see.


Architectural Implications

If you're building agent infrastructure, here's how to apply this:

1. Separate the Concerns

Don't try to make your policy engine also handle verification. They operate at different points in the lifecycle and require different data.

Policy needs: principal identity, action type, target resource, context labels.

Verification needs: browser state, DOM snapshots, network responses, before/after comparison.

2. Make Verification Mandatory

Every state-changing action should have a verification predicate. If the planner doesn't specify one, inject conservative defaults:

1if step.action == "CLICK" and not step.verify:
2  step.verify = [snapshot_changed()]  # At minimum, something should change

3. Capture Evidence on Failure

When verification fails, capture the state that caused failure. Screenshots, DOM snapshots, network logs. This transforms "it didn't work" into "here's exactly what happened."

4. Use Fresh State for Verification

Never verify against cached or predicted state. Always snapshot immediately before evaluating predicates. State can change between action and verification.

5. Fail Fast, Fail Loud

Verification failures should halt execution and surface clearly. The worst outcome is an agent that continues after silent failure, compounding errors.


Closing

Policy enforcement is necessary but not sufficient for agent correctness. It prevents unauthorized execution. It doesn't guarantee correct outcomes.

Verification fills the gap. Deterministic predicates, evaluated against actual state, provide the ground truth that policy can't.

If your agent logs look healthy but your outcomes are wrong, you have a verification problem. Policy answered "can this execute?" but nobody asked "did reality move?"

At Predicate Systems, we build infrastructure for both. Our Authority sidecar handles policy with cryptographic mandates. Our Runtime SDK handles verification with deterministic predicates.

Because "allowed to run" and "ran correctly" are different questions—and production agents need answers to both.