Docs/Predicate Claw/Operational Runbook

Operational Runbook

Step-by-step procedures for operating and troubleshooting the OpenClaw Predicate-Claw Plugin in production environments.

Keep this runbook handy for incident response and routine operations.


Table of Contents


Quick Reference

Incident TypeSeverityFirst Response
Circuit breaker openP1Check sidecar health
Elevated deny rateP2Compare to policy changes
High latencyP3Check sidecar resources
Audit export failuresP4Check control plane connectivity

Prerequisites

Before using this runbook, ensure you have:


Incident Response Procedures

P1: Circuit Breaker Stuck Open

Symptoms:

Diagnosis Steps:

  1. Check sidecar health

    curl -s http://localhost:8787/health | jq .

    Expected: {"status": "healthy"}

  2. Check sidecar logs for errors

    journalctl -u predicate-authorityd -n 100 --no-pager
    # or
    docker logs predicate-authorityd --tail 100
  3. Verify network connectivity

    curl -w "@curl-format.txt" -s -o /dev/null http://localhost:8787/health
  4. Check control plane sync status

    curl -s http://localhost:8787/v1/sync/status | jq .

Resolution Steps:

  1. If sidecar is unhealthy:

    # Restart sidecar
    systemctl restart predicate-authorityd
    # or
    docker restart predicate-authorityd
  2. If sidecar is healthy but circuit is still open:

    • Circuit will auto-recover after resetTimeoutMs (default: 30s)
    • For immediate recovery, restart the provider process
  3. If control plane sync is failing:

    • Check control plane endpoint accessibility
    • Verify API credentials are valid
    • Check for control plane service incidents

Escalation:


P2: Elevated Deny Rate

Symptoms:

Diagnosis Steps:

  1. Check deny rate trend

    # Query recent deny events
    curl -s "http://localhost:8787/v1/audit/decisions?outcome=deny&limit=50" | jq .
  2. Compare to recent policy changes

    • Check control plane for recent policy deployments
    • Review policy version in metrics
  3. Identify affected actions/resources

    # Group denials by action
    curl -s "http://localhost:8787/v1/audit/decisions?outcome=deny" | \
      jq -r '.items | group_by(.action) | map({action: .[0].action, count: length})'
  4. Check for attack patterns

    • Look for repeated denials from same principal
    • Check for unusual resource patterns (path traversal, etc.)

Resolution Steps:

  1. If caused by policy change:

    • Rollback to previous policy version via control plane
    • Or fix policy and redeploy
  2. If attack attempt:

    • Document attack patterns
    • Consider adding rate limiting
    • Report to security team
  3. If false positives:

    • Review policy rules for overly broad denials
    • Add specific allow rules for legitimate use cases

Escalation:


P3: High Authorization Latency

Symptoms:

Diagnosis Steps:

  1. Check current latency percentiles

    curl -s http://localhost:8787/metrics | grep predicate_auth_latency
  2. Check sidecar resource usage

    # CPU and memory
    top -p $(pgrep predicate-authorityd)
    # or
    docker stats predicate-authorityd --no-stream
  3. Check control plane sync load

    curl -s http://localhost:8787/v1/sync/status | jq '.last_sync_duration_ms'
  4. Check concurrent request volume

    curl -s http://localhost:8787/metrics | grep predicate_auth_concurrent

Resolution Steps:

  1. If sidecar CPU is high:

    • Check for runaway policy evaluation
    • Consider scaling sidecar resources
    • Review policy complexity
  2. If sync is slow:

    • Check control plane latency
    • Consider increasing sync interval
    • Review policy size
  3. If high concurrent load:

    • Consider horizontal scaling
    • Review request batching options
    • Check for retry storms

Escalation:


P4: Audit Export Failures

Symptoms:

Diagnosis Steps:

  1. Check export error logs

    grep "audit.*error" /var/log/provider.log | tail -20
  2. Verify control plane connectivity

    curl -s https://control-plane.example.com/health
  3. Check export queue depth

    curl -s http://localhost:8787/metrics | grep predicate_audit_queue

Resolution Steps:

  1. If control plane unreachable:

    • Check network/firewall rules
    • Verify TLS certificates
    • Check for control plane incidents
  2. If queue is backed up:

    • Audit export is best-effort; auth continues working
    • Events will retry automatically
    • Check disk space for local buffer
  3. If credentials expired:

    • Rotate API credentials
    • Update provider configuration
    • Restart provider

Escalation:


Routine Operations

Restarting the Provider

# Graceful restart (allows in-flight requests to complete)
systemctl reload openclaw-provider

# Full restart
systemctl restart openclaw-provider

Rotating Credentials

  1. Generate new credentials in control plane
  2. Update provider configuration
  3. Restart provider
  4. Verify connectivity
  5. Revoke old credentials

Updating Policy

  1. Deploy new policy to control plane
  2. Monitor sync status on sidecars
  3. Watch deny rate for anomalies
  4. Rollback if issues detected

Scaling Sidecars

For high-load environments:

  1. Deploy additional sidecar instances
  2. Configure load balancer
  3. Update provider baseUrl to load balancer
  4. Verify even distribution

Health Checks

Provider Health

# Local provider health
curl -s http://localhost:3000/health

# Expected response
{
  "status": "healthy",
  "sidecar": "connected",
  "circuit": "closed"
}

Sidecar Health

# Sidecar health
curl -s http://localhost:8787/health

# Expected response
{
  "status": "healthy",
  "policy_version": "v1.2.3",
  "last_sync": "2026-02-20T12:00:00Z"
}

End-to-End Check

# Test authorization flow
curl -X POST http://localhost:8787/v1/authorize \
  -H "Content-Type: application/json" \
  -d '{
    "principal": "test:health-check",
    "action": "health.check",
    "resource": "system"
  }'

# Expected: allow decision for health check action

Monitoring Checklist

Daily

Weekly

Monthly


Contact Information

RoleContact
On-call engineerPagerDuty: predicate-oncall
Platform teamSlack: #predicate-platform
Security teamSlack: #security-incidents
Control plane statushttps://status.predicatesystems.ai

Appendix

Useful Commands

# View real-time logs
journalctl -u predicate-authorityd -f

# Check process status
systemctl status predicate-authorityd

# View metrics
curl -s http://localhost:8787/metrics

# Force policy sync
curl -X POST http://localhost:8787/v1/sync/trigger

# Get current policy version
curl -s http://localhost:8787/v1/policy/version

Log Locations

ComponentLog Path
Provider/var/log/openclaw-provider/provider.log
Sidecar/var/log/predicate-authorityd/sidecar.log
Audit events/var/log/predicate-authorityd/audit.jsonl

Configuration Files

ComponentConfig Path
Provider/etc/openclaw-provider/config.yaml
Sidecar/etc/predicate-authorityd/config.yaml
PolicyManaged via control plane