This guide shows how to monitor OpenClaw agents end-to-end: heartbeat health, memory reliability, incident loops, and operator dashboards you can actually run every day.
Updated for organic acquisition quality. Revision 2.
| KPI | Why it matters | Target |
|---|---|---|
| Heartbeat success rate | Shows if recurring supervision is actually running. | > 99% |
| Incident closure rate (24h) | Measures how many incidents are fully resolved, not only detected. | > 90% |
| Mean time to recovery (MTTR) | Primary speed metric for operational resilience. | < 15 min |
| Repeat incident rate (7d) | Signals whether fixes are durable or temporary. | < 10% |
| Memory recall hit rate | Confirms saved context is actually being reused in future cycles. | > 85% |
How often should I refresh this runbook? Weekly for active projects, and immediately after meaningful incident clusters.
Can this run without human intervention? Yes, but keep approval checkpoints for risk-sensitive actions and strategy changes.
What should I alert first: every error or only critical ones? Start with critical-path failures (gateway down, memory failure, queue stuck), then add lower-severity alerts gradually.