Operational Runbook
Step-by-step procedures for operating and troubleshooting the OpenClaw Predicate-Claw Plugin in production environments.
Keep this runbook handy for incident response and routine operations.
Table of Contents
- Quick Reference
- Prerequisites
- Incident Response Procedures
- Routine Operations
- Health Checks
- Monitoring Checklist
- Contact Information
- Appendix
Quick Reference
| Incident Type | Severity | First Response |
|---|---|---|
| Circuit breaker open | P1 | Check sidecar health |
| Elevated deny rate | P2 | Compare to policy changes |
| High latency | P3 | Check sidecar resources |
| Audit export failures | P4 | Check control plane connectivity |
Prerequisites
Before using this runbook, ensure you have:
- Access to provider logs and metrics dashboards
- Access to sidecar logs (
predicate-authorityd) - Ability to restart provider/sidecar processes
- Contact information for on-call escalation
Incident Response Procedures
P1: Circuit Breaker Stuck Open
Symptoms:
- All authorization requests failing immediately
CircuitOpenErrorin provider logs- Metrics showing
predicate_circuit_state = open
Diagnosis Steps:
-
Check sidecar health
curl -s http://localhost:8787/health | jq .Expected:
{"status": "healthy"} -
Check sidecar logs for errors
journalctl -u predicate-authorityd -n 100 --no-pager # or docker logs predicate-authorityd --tail 100 -
Verify network connectivity
curl -w "@curl-format.txt" -s -o /dev/null http://localhost:8787/health -
Check control plane sync status
curl -s http://localhost:8787/v1/sync/status | jq .
Resolution Steps:
-
If sidecar is unhealthy:
# Restart sidecar systemctl restart predicate-authorityd # or docker restart predicate-authorityd -
If sidecar is healthy but circuit is still open:
- Circuit will auto-recover after
resetTimeoutMs(default: 30s) - For immediate recovery, restart the provider process
- Circuit will auto-recover after
-
If control plane sync is failing:
- Check control plane endpoint accessibility
- Verify API credentials are valid
- Check for control plane service incidents
Escalation:
- If not resolved in 5 minutes, page on-call engineer
- If sidecar restart doesn't help, escalate to platform team
P2: Elevated Deny Rate
Symptoms:
- Sudden increase in deny decisions (>2x baseline)
- User reports of blocked actions
denied_by_policyreason code spike
Diagnosis Steps:
-
Check deny rate trend
# Query recent deny events curl -s "http://localhost:8787/v1/audit/decisions?outcome=deny&limit=50" | jq . -
Compare to recent policy changes
- Check control plane for recent policy deployments
- Review policy version in metrics
-
Identify affected actions/resources
# Group denials by action curl -s "http://localhost:8787/v1/audit/decisions?outcome=deny" | \ jq -r '.items | group_by(.action) | map({action: .[0].action, count: length})' -
Check for attack patterns
- Look for repeated denials from same principal
- Check for unusual resource patterns (path traversal, etc.)
Resolution Steps:
-
If caused by policy change:
- Rollback to previous policy version via control plane
- Or fix policy and redeploy
-
If attack attempt:
- Document attack patterns
- Consider adding rate limiting
- Report to security team
-
If false positives:
- Review policy rules for overly broad denials
- Add specific allow rules for legitimate use cases
Escalation:
- If attack suspected, notify security team immediately
- If policy rollback needed, coordinate with policy owners
P3: High Authorization Latency
Symptoms:
- p95 latency > 150ms
- Slow tool execution reported by users
- Timeout errors in logs
Diagnosis Steps:
-
Check current latency percentiles
curl -s http://localhost:8787/metrics | grep predicate_auth_latency -
Check sidecar resource usage
# CPU and memory top -p $(pgrep predicate-authorityd) # or docker stats predicate-authorityd --no-stream -
Check control plane sync load
curl -s http://localhost:8787/v1/sync/status | jq '.last_sync_duration_ms' -
Check concurrent request volume
curl -s http://localhost:8787/metrics | grep predicate_auth_concurrent
Resolution Steps:
-
If sidecar CPU is high:
- Check for runaway policy evaluation
- Consider scaling sidecar resources
- Review policy complexity
-
If sync is slow:
- Check control plane latency
- Consider increasing sync interval
- Review policy size
-
If high concurrent load:
- Consider horizontal scaling
- Review request batching options
- Check for retry storms
Escalation:
- If resources are maxed, request capacity increase
- If policy is too complex, work with policy team to optimize
P4: Audit Export Failures
Symptoms:
- Missing audit events in control plane
audit_export_failurein logs- Non-zero
predicate_audit_failurescounter
Diagnosis Steps:
-
Check export error logs
grep "audit.*error" /var/log/provider.log | tail -20 -
Verify control plane connectivity
curl -s https://control-plane.example.com/health -
Check export queue depth
curl -s http://localhost:8787/metrics | grep predicate_audit_queue
Resolution Steps:
-
If control plane unreachable:
- Check network/firewall rules
- Verify TLS certificates
- Check for control plane incidents
-
If queue is backed up:
- Audit export is best-effort; auth continues working
- Events will retry automatically
- Check disk space for local buffer
-
If credentials expired:
- Rotate API credentials
- Update provider configuration
- Restart provider
Escalation:
- Audit failures are P4 (non-blocking)
- Escalate only if prolonged (>1 hour) or compliance-critical
Routine Operations
Restarting the Provider
# Graceful restart (allows in-flight requests to complete)
systemctl reload openclaw-provider
# Full restart
systemctl restart openclaw-providerRotating Credentials
- Generate new credentials in control plane
- Update provider configuration
- Restart provider
- Verify connectivity
- Revoke old credentials
Updating Policy
- Deploy new policy to control plane
- Monitor sync status on sidecars
- Watch deny rate for anomalies
- Rollback if issues detected
Scaling Sidecars
For high-load environments:
- Deploy additional sidecar instances
- Configure load balancer
- Update provider
baseUrlto load balancer - Verify even distribution
Health Checks
Provider Health
# Local provider health
curl -s http://localhost:3000/health
# Expected response
{
"status": "healthy",
"sidecar": "connected",
"circuit": "closed"
}Sidecar Health
# Sidecar health
curl -s http://localhost:8787/health
# Expected response
{
"status": "healthy",
"policy_version": "v1.2.3",
"last_sync": "2026-02-20T12:00:00Z"
}End-to-End Check
# Test authorization flow
curl -X POST http://localhost:8787/v1/authorize \
-H "Content-Type: application/json" \
-d '{
"principal": "test:health-check",
"action": "health.check",
"resource": "system"
}'
# Expected: allow decision for health check actionMonitoring Checklist
Daily
- Review deny rate trends
- Check circuit breaker state
- Verify audit export completeness
Weekly
- Review latency percentiles
- Check policy sync freshness
- Audit access logs
Monthly
- Review and update SLO thresholds
- Test incident response procedures
- Update runbook with learnings
Contact Information
| Role | Contact |
|---|---|
| On-call engineer | PagerDuty: predicate-oncall |
| Platform team | Slack: #predicate-platform |
| Security team | Slack: #security-incidents |
| Control plane status | https://status.predicatesystems.ai |
Appendix
Useful Commands
# View real-time logs
journalctl -u predicate-authorityd -f
# Check process status
systemctl status predicate-authorityd
# View metrics
curl -s http://localhost:8787/metrics
# Force policy sync
curl -X POST http://localhost:8787/v1/sync/trigger
# Get current policy version
curl -s http://localhost:8787/v1/policy/versionLog Locations
| Component | Log Path |
|---|---|
| Provider | /var/log/openclaw-provider/provider.log |
| Sidecar | /var/log/predicate-authorityd/sidecar.log |
| Audit events | /var/log/predicate-authorityd/audit.jsonl |
Configuration Files
| Component | Config Path |
|---|---|
| Provider | /etc/openclaw-provider/config.yaml |
| Sidecar | /etc/predicate-authorityd/config.yaml |
| Policy | Managed via control plane |