Policy Answers 'Can This Execute?' Verification Answers 'Did Reality Move?'
Why authorization alone doesn't solve agent correctness—and what does.
If you're building agent runtimes, you've probably implemented some form of policy enforcement. Maybe it's a sidecar that gates tool calls. Maybe it's an allowlist of permitted actions. Maybe it's an LLM-as-judge that reviews requests before execution.
These are all reasonable. They're also insufficient.
Policy enforcement answers one question: can this action execute? But there's a second question that matters more for correctness: did reality actually change?
This post explains why the distinction matters and how to architect systems that answer both.
The Illusion of Healthy Execution
Consider a browser agent tasked with applying a discount filter on an e-commerce site. The execution trace looks clean:
1[14:23:01] ACTION: click(selector="#filter-discount")
2[14:23:01] POLICY: action=browser.click resource=https://shop.example.com → ALLOWED
3[14:23:02] RESULT: HTTP 200, no exception
4[14:23:02] STATUS: step completedPolicy allowed it. The browser didn't throw. The API returned 200.
But the filter didn't apply. The page reloaded with identical content. The agent proceeds to extract "discounted" items that aren't discounted.
This is what we call locally valid, globally wrong.
The action was permitted. The execution succeeded. The outcome was incorrect.
Why Policy Enforcement Doesn't Solve Correctness
Policy enforcement operates at the request boundary. It intercepts an action before execution and decides: should this proceed?
A well-designed policy engine evaluates rules against principal, action, and resource:
1// Simplified policy evaluation
2for rule in rules {
3 if matches(rule, request.principal, request.action, request.resource) {
4 match rule.effect {
5 Deny => return DENY,
6 Allow => return ALLOW_WITH_MANDATE,
7 }
8 }
9}
10return DENY // default: fail-closedThis is essential for security. You don't want an agent making unauthorized HTTP requests or reading files outside its scope. Policy enforcement prevents execution of disallowed actions.
But policy can't answer:
- Did the click actually trigger the expected state change?
- Did the form submission succeed, or did it fail silently?
- Is the page now showing the expected content?
Policy operates on intent. Verification operates on outcome.
The Failure Modes Policy Misses
Here are real scenarios where policy passes but execution fails:
1. Silent No-Op
The agent clicks a button. The click event fires. But the button was disabled by CSS, or the click handler had an early return, or the element was occluded by an overlay.
1Policy: ALLOWED (browser.click on target URL)
2Reality: Button click had no effect2. Wrong Object Selected
The agent was supposed to select "Premium Plan" but selected "Basic Plan" due to a stale DOM reference or an element that shifted during render.
1Policy: ALLOWED (dropdown.select action)
2Reality: Wrong option selected, agent doesn't notice3. Navigation Without State Change
The agent navigates to a new page. The URL changes. But the page content is a soft 404, or a loading state that never resolves, or an error page styled to look normal.
1Policy: ALLOWED (browser.navigate action)
2Reality: Page in error state, extraction will fail4. Race Condition Wins
The agent submits a form. The server accepts it. But a concurrent process (another agent, a cron job, a human) modified the underlying data between read and write.
1Policy: ALLOWED (form.submit action)
2Reality: Stale write, data inconsistencyIn all these cases, the logs look healthy. The policy trace shows green. The failure surfaces later—often as corrupted data or a confused downstream system.
Deterministic Post-Execution Verification
The solution is verification: deterministic assertions that run after execution to confirm the expected state change occurred.
In our Predicate Runtime SDK, this looks like:
1# Execute the action
2await runtime.click(selector="role=button text~'Apply Filter'")
3
4# Verify the outcome
5runtime.assert_(
6 snapshot_changed(),
7 label="filter_applied",
8 required=True
9)The snapshot_changed() predicate compares a content digest before and after the action. If the digest is identical, the assertion fails—regardless of whether the click "succeeded."
Built-in Verification Predicates
| Predicate | What It Checks |
|---|---|
url_contains(substring) | Current URL includes expected path |
url_matches(regex) | URL matches pattern (useful for dynamic IDs) |
exists(selector) | Element is present in DOM |
not_exists(selector) | Element is absent (modal dismissed, error cleared) |
snapshot_changed() | Page content hash differs from before |
is_enabled(selector) | Form element is interactive |
value_equals(selector, expected) | Input field contains expected value |
element_count(selector, min, max) | Element count within range |
These predicates are evaluated against a fresh snapshot—not cached state, not the LLM's belief about the page, but the actual DOM at verification time.
Compositional Verification
Predicates compose logically:
1# All conditions must pass
2all_of(
3 url_contains("/checkout"),
4 exists("role=heading text~'Order Summary'"),
5 not_exists("role=alert")
6)
7
8# Any condition passing is sufficient
9any_of(
10 exists("text~'Success'"),
11 exists("text~'Order Confirmed'")
12)This enables precise, readable verification that matches the expected UI state.
Why Not LLM-as-Judge?
A common alternative is to ask an LLM to verify outcomes. After each action, screenshot the page and prompt: "Did this action succeed?"
This approach has three problems:
1. Non-Determinism
The same screenshot, evaluated twice, may produce different answers. This makes debugging impossible and introduces flakiness that compounds across multi-step tasks.
2. Cost and Latency
Vision model inference adds 1-5 seconds per verification step. For a 20-step task, that's 20-100 seconds of verification latency alone—plus the token cost.
3. Ungrounded Reasoning
LLMs hallucinate about UI state. They'll confidently assert that a button was clicked when it wasn't, or that a form was submitted when the page shows an error.
Deterministic predicates have none of these problems. url_contains("/cart") either passes or fails. There's no interpretation, no variability, no hallucination.
Why Logs Look Healthy While Runs Drift
The fundamental issue is that most agent observability is action-centric, not outcome-centric.
Standard logging captures:
- What action was requested
- Whether policy allowed it
- Whether the action threw an exception
- What the response code was
This tells you the action executed. It doesn't tell you the action worked.
Consider this trace:
1Step 1: click("Add to Cart") → ALLOWED, 200, no error
2Step 2: click("Checkout") → ALLOWED, 200, no error
3Step 3: fill("Email", "user@example.com") → ALLOWED, 200, no error
4Step 4: click("Place Order") → ALLOWED, 200, no errorLooks perfect. But what if:
- Step 1 added the wrong item (element index off by one)
- Step 3 filled the wrong field (two email inputs on page)
- Step 4 triggered validation error (rendered as styled div, not thrown exception)
Without verification, you don't know until the user complains.
With verification:
1Step 1: click("Add to Cart")
2→ verify: exists("text~'Widget X'") in cart → FAIL
3→ artifact: screenshot showing "Widget Y" in cartThe failure surfaces immediately, with evidence.
The Predicate Systems Model
We frame the problem as two distinct concerns:
Policy = Execution Permission
Policy enforcement answers: Is this action allowed to execute?
This is a gate at the request boundary. It evaluates rules, checks principals, validates resources, and issues cryptographic mandates (short-lived tokens proving authorization).
Policy is about security and scope control. It prevents agents from doing things they shouldn't.
Verification = State Correctness
Verification answers: Did the action achieve the expected outcome?
This is a check after execution. It evaluates predicates against actual state, captures evidence, and gates step completion.
Verification is about correctness and reliability. It catches silent failures that policy can't see.
Architectural Implications
If you're building agent infrastructure, here's how to apply this:
1. Separate the Concerns
Don't try to make your policy engine also handle verification. They operate at different points in the lifecycle and require different data.
Policy needs: principal identity, action type, target resource, context labels.
Verification needs: browser state, DOM snapshots, network responses, before/after comparison.
2. Make Verification Mandatory
Every state-changing action should have a verification predicate. If the planner doesn't specify one, inject conservative defaults:
1if step.action == "CLICK" and not step.verify:
2 step.verify = [snapshot_changed()] # At minimum, something should change3. Capture Evidence on Failure
When verification fails, capture the state that caused failure. Screenshots, DOM snapshots, network logs. This transforms "it didn't work" into "here's exactly what happened."
4. Use Fresh State for Verification
Never verify against cached or predicted state. Always snapshot immediately before evaluating predicates. State can change between action and verification.
5. Fail Fast, Fail Loud
Verification failures should halt execution and surface clearly. The worst outcome is an agent that continues after silent failure, compounding errors.
Closing
Policy enforcement is necessary but not sufficient for agent correctness. It prevents unauthorized execution. It doesn't guarantee correct outcomes.
Verification fills the gap. Deterministic predicates, evaluated against actual state, provide the ground truth that policy can't.
If your agent logs look healthy but your outcomes are wrong, you have a verification problem. Policy answered "can this execute?" but nobody asked "did reality move?"
At Predicate Systems, we build infrastructure for both. Our Authority sidecar handles policy with cryptographic mandates. Our Runtime SDK handles verification with deterministic predicates.
Because "allowed to run" and "ran correctly" are different questions—and production agents need answers to both.