The heartbeat.md file is the single source of truth for how your OpenClaw agent monitors its own health. This guide walks you through every section of the file with production-ready examples you can copy and adapt.
Every OpenClaw agent can run an autonomous health loop. The heartbeat.md file tells the agent how often to check itself, what thresholds signal trouble, where to send alerts, and what recovery actions to take when something goes wrong.
Without a heartbeat.md, your agent either uses global defaults (which may not fit your use case) or skips health checks entirely. For production agents, a well-configured heartbeat.md is non-negotiable. For broader context on heartbeat loops, see the heartbeat patterns guide.
Place the file at ~/.openclaw/skills/<your-skill>/heartbeat.md. OpenClaw reads it on agent boot and on every cron cycle.
heartbeat.md uses a YAML front-matter block followed by optional markdown notes. Here is the minimal skeleton:
---
interval: 5m
thresholds:
wellness_score_min: 60
error_rate_max: 0.05
alerts:
- type: webhook
url: https://hooks.example.com/heartbeat
recovery:
- action: restart
---
# Notes
This agent monitors e-commerce inventory.
Escalate to on-call if wellness drops below 40.The YAML block is required. The markdown body below the closing --- is optional and useful for human-readable context that the agent can reference during self-assessment.
The interval field controls how frequently the agent runs its health check loop. Supported units: s (seconds), m (minutes), h (hours).
# Critical agent (financial, safety) interval: 1m # Standard production agent interval: 5m # Low-priority background task interval: 15m
Tradeoffs:
For a deeper comparison of cadence strategies, see heartbeat cadence in the glossary.
Thresholds define the boundaries between healthy and degraded states. When any threshold is breached, the agent triggers its configured alerts and recovery actions.
thresholds: wellness_score_min: 60 # Trigger if score drops below 60/100 error_rate_max: 0.05 # Trigger if >5% of recent calls fail consecutive_failures: 3 # Trigger after 3 failures in a row memory_usage_max_mb: 512 # Trigger if agent memory exceeds 512 MB response_latency_max_ms: 5000 # Trigger if avg response > 5 seconds
You do not need all thresholds. Start with wellness_score_min and error_rate_max, then add more as you learn what matters for your agent. The wellness score is computed by the burnout detection module and factors in error rate, retry count, session age, and task completion rate.
When a threshold is breached, OpenClaw sends alerts to every channel listed. You can configure multiple channels for redundancy.
alerts:
- type: webhook
url: https://hooks.example.com/openclaw-heartbeat
headers:
Authorization: "Bearer ${HEARTBEAT_TOKEN}"
payload_template: |
{"agent": "{{agent_id}}", "score": {{wellness_score}}, "event": "{{event}}"}alerts:
- type: slack
webhook_url: https://hooks.slack.com/services/T00/B00/xxxx
channel: "#agent-alerts"
mention: "@oncall"alerts:
- type: email
to: ops@example.com
subject_template: "[OpenClaw] Agent {{agent_id}} health alert"
smtp_config: default # uses global SMTP settingsAll alert types support template variables: {{agent_id}}, {{wellness_score}}, {{error_rate}}, {{event}}, and {{timestamp}}.
Recovery actions run in order when a threshold breach is detected. The agent attempts each action sequentially and stops if one succeeds.
recovery:
- action: restart
delay: 10s
max_retries: 2
- action: session_reset
preserve_context: true
- action: escalate
target: human
channel: slack
message: "Agent {{agent_id}} failed recovery. Manual intervention needed."Available recovery actions:
delay to avoid restart loops.preserve_context: true to retain long-term memory.Instead of a fixed interval, you can configure the heartbeat to speed up when the agent is struggling and slow down when stable. This saves tokens during quiet periods and catches problems faster during incidents.
interval: 5m
adaptive_cadence:
enabled: true
rules:
- when: wellness_score < 50
interval: 1m
- when: wellness_score < 70
interval: 3m
- when: error_rate > 0.10
interval: 1m
- when: consecutive_failures > 0
interval: 2m
cooldown: 10m # Stay at fast cadence for at least 10min after triggerThe base interval applies when no adaptive rules match. Rules are evaluated top-to-bottom; the first match wins. The cooldown prevents flapping between fast and slow intervals. For more on adaptive strategies, see the observability playbook.
Here is a complete, production-ready heartbeat.md for an e-commerce inventory agent:
---
interval: 5m
adaptive_cadence:
enabled: true
rules:
- when: wellness_score < 50
interval: 1m
- when: error_rate > 0.08
interval: 2m
cooldown: 15m
thresholds:
wellness_score_min: 55
error_rate_max: 0.05
consecutive_failures: 3
response_latency_max_ms: 4000
alerts:
- type: slack
webhook_url: https://hooks.slack.com/services/T00/B00/xxxx
channel: "#inventory-agent"
mention: "@oncall"
- type: webhook
url: https://monitor.example.com/api/heartbeat
headers:
Authorization: "Bearer ${MONITOR_TOKEN}"
recovery:
- action: restart
delay: 15s
max_retries: 2
- action: session_reset
preserve_context: true
- action: escalate
target: human
channel: slack
message: "Inventory agent {{agent_id}} needs manual recovery."
metadata:
owner: inventory-team
environment: production
tags:
- ecommerce
- critical
---
# Inventory Agent Heartbeat
This agent manages real-time stock levels across 3 warehouses.
During flash sales, adaptive cadence will increase check frequency.
Escalation goes to the inventory team Slack channel.heartbeat.md (not heartbeat.yaml or heartbeat.json).~/.openclaw/skills/<skill-name>/heartbeat.md.openclaw status <skill-name> to confirm the agent is loaded.--- on its own line.openclaw validate heartbeat.md before deploying.curl.$HEARTBEAT_TOKEN) are set in the agent runtime.openclaw logs <skill-name> --filter heartbeat.max_retries on the restart action.delay of at least 10 seconds between restart attempts.escalate so a human gets notified when auto-recovery fails.For a comprehensive troubleshooting reference, see the OpenClaw best practices guide.