What is heartbeat.md in OpenClaw?

heartbeat.md is the configuration file that controls how OpenClaw agents perform health checks. It defines the check interval, health thresholds, alert channels, and recovery actions for autonomous agents.

Where do I place the heartbeat.md file?

Place heartbeat.md in your skill root directory, typically at ~/.openclaw/skills/ /heartbeat.md. OpenClaw reads it automatically when the agent starts its health loop.

What heartbeat interval should I use for production agents?

For most production agents, a 5-minute interval balances responsiveness with resource efficiency. Use 1-minute intervals only for critical financial or safety agents, and 15-minute intervals for low-priority background tasks.

OpenClaw / Heartbeat Configuration

How to Configure heartbeat.md for OpenClaw Agents

The heartbeat.md file is the single source of truth for how your OpenClaw agent monitors its own health. This guide walks you through every section of the file with production-ready examples you can copy and adapt.

1. What is heartbeat.md?

Every OpenClaw agent can run an autonomous health loop. The heartbeat.md file tells the agent how often to check itself, what thresholds signal trouble, where to send alerts, and what recovery actions to take when something goes wrong.

Without a heartbeat.md, your agent either uses global defaults (which may not fit your use case) or skips health checks entirely. For production agents, a well-configured heartbeat.md is non-negotiable. For broader context on heartbeat loops, see the heartbeat patterns guide.

Place the file at ~/.openclaw/skills/<your-skill>/heartbeat.md. OpenClaw reads it on agent boot and on every cron cycle.

2. Basic heartbeat.md structure

heartbeat.md uses a YAML front-matter block followed by optional markdown notes. Here is the minimal skeleton:

---
interval: 5m
thresholds:
  wellness_score_min: 60
  error_rate_max: 0.05
alerts:
  - type: webhook
    url: https://hooks.example.com/heartbeat
recovery:
  - action: restart
---

# Notes
This agent monitors e-commerce inventory.
Escalate to on-call if wellness drops below 40.

The YAML block is required. The markdown body below the closing --- is optional and useful for human-readable context that the agent can reference during self-assessment.

3. Setting the heartbeat interval

The interval field controls how frequently the agent runs its health check loop. Supported units: s (seconds), m (minutes), h (hours).

# Critical agent (financial, safety)
interval: 1m

# Standard production agent
interval: 5m

# Low-priority background task
interval: 15m

Tradeoffs:

1-minute interval: Catches failures fast but uses more LLM tokens and API calls. Best for agents where downtime costs money.
5-minute interval: The sweet spot for most production agents. Fast enough to catch drift, cheap enough to run indefinitely.
15-minute interval: Low overhead, suitable for batch jobs or agents that tolerate delayed recovery. Not recommended for user-facing agents.

For a deeper comparison of cadence strategies, see heartbeat cadence in the glossary.

4. Configuring health thresholds

Thresholds define the boundaries between healthy and degraded states. When any threshold is breached, the agent triggers its configured alerts and recovery actions.

thresholds:
  wellness_score_min: 60      # Trigger if score drops below 60/100
  error_rate_max: 0.05        # Trigger if >5% of recent calls fail
  consecutive_failures: 3     # Trigger after 3 failures in a row
  memory_usage_max_mb: 512    # Trigger if agent memory exceeds 512 MB
  response_latency_max_ms: 5000  # Trigger if avg response > 5 seconds

You do not need all thresholds. Start with wellness_score_min and error_rate_max, then add more as you learn what matters for your agent. The wellness score is computed by the burnout detection module and factors in error rate, retry count, session age, and task completion rate.

5. Alert channels

When a threshold is breached, OpenClaw sends alerts to every channel listed. You can configure multiple channels for redundancy.

Webhook

alerts:
  - type: webhook
    url: https://hooks.example.com/openclaw-heartbeat
    headers:
      Authorization: "Bearer ${HEARTBEAT_TOKEN}"
    payload_template: |
      {"agent": "{{agent_id}}", "score": {{wellness_score}}, "event": "{{event}}"}

Slack

alerts:
  - type: slack
    webhook_url: https://hooks.slack.com/services/T00/B00/xxxx
    channel: "#agent-alerts"
    mention: "@oncall"

Email

alerts:
  - type: email
    to: ops@example.com
    subject_template: "[OpenClaw] Agent {{agent_id}} health alert"
    smtp_config: default  # uses global SMTP settings

All alert types support template variables: {{agent_id}}, {{wellness_score}}, {{error_rate}}, {{event}}, and {{timestamp}}.

6. Recovery actions on failure

Recovery actions run in order when a threshold breach is detected. The agent attempts each action sequentially and stops if one succeeds.

recovery:
  - action: restart
    delay: 10s
    max_retries: 2

  - action: session_reset
    preserve_context: true

  - action: escalate
    target: human
    channel: slack
    message: "Agent {{agent_id}} failed recovery. Manual intervention needed."

Available recovery actions:

restart: Kills the current agent process and starts a fresh one. Use delay to avoid restart loops.
session_reset: Clears the current session state but keeps the agent running. Set preserve_context: true to retain long-term memory.
escalate: Sends a message to a human operator. This is the last resort and should always be the final action in the chain.
run_script: Executes a custom shell script. Use for domain-specific recovery (e.g., clearing a queue, rotating credentials).

7. Advanced: Adaptive cadence

Instead of a fixed interval, you can configure the heartbeat to speed up when the agent is struggling and slow down when stable. This saves tokens during quiet periods and catches problems faster during incidents.

interval: 5m
adaptive_cadence:
  enabled: true
  rules:
    - when: wellness_score < 50
      interval: 1m
    - when: wellness_score < 70
      interval: 3m
    - when: error_rate > 0.10
      interval: 1m
    - when: consecutive_failures > 0
      interval: 2m
  cooldown: 10m  # Stay at fast cadence for at least 10min after trigger

The base interval applies when no adaptive rules match. Rules are evaluated top-to-bottom; the first match wins. The cooldown prevents flapping between fast and slow intervals. For more on adaptive strategies, see the observability playbook.

8. Full production heartbeat.md

Here is a complete, production-ready heartbeat.md for an e-commerce inventory agent:

---
interval: 5m

adaptive_cadence:
  enabled: true
  rules:
    - when: wellness_score < 50
      interval: 1m
    - when: error_rate > 0.08
      interval: 2m
  cooldown: 15m

thresholds:
  wellness_score_min: 55
  error_rate_max: 0.05
  consecutive_failures: 3
  response_latency_max_ms: 4000

alerts:
  - type: slack
    webhook_url: https://hooks.slack.com/services/T00/B00/xxxx
    channel: "#inventory-agent"
    mention: "@oncall"
  - type: webhook
    url: https://monitor.example.com/api/heartbeat
    headers:
      Authorization: "Bearer ${MONITOR_TOKEN}"

recovery:
  - action: restart
    delay: 15s
    max_retries: 2
  - action: session_reset
    preserve_context: true
  - action: escalate
    target: human
    channel: slack
    message: "Inventory agent {{agent_id}} needs manual recovery."

metadata:
  owner: inventory-team
  environment: production
  tags:
    - ecommerce
    - critical
---

# Inventory Agent Heartbeat
This agent manages real-time stock levels across 3 warehouses.
During flash sales, adaptive cadence will increase check frequency.
Escalation goes to the inventory team Slack channel.

9. Troubleshooting common heartbeat issues

Heartbeat not running

Verify the file is named exactly heartbeat.md (not heartbeat.yaml or heartbeat.json).
Check the file is in the correct skill directory: ~/.openclaw/skills/<skill-name>/heartbeat.md.
Run openclaw status <skill-name> to confirm the agent is loaded.

YAML parse errors

Ensure the YAML block starts and ends with --- on its own line.
Avoid tabs; use spaces for indentation.
Validate with openclaw validate heartbeat.md before deploying.

Alerts not firing

Test your webhook URL independently with curl.
Check that environment variables (e.g., $HEARTBEAT_TOKEN) are set in the agent runtime.
Review agent logs: openclaw logs <skill-name> --filter heartbeat.

Recovery loops / restart storms

Always set max_retries on the restart action.
Add a delay of at least 10 seconds between restart attempts.
End the recovery chain with escalate so a human gets notified when auto-recovery fails.

For a comprehensive troubleshooting reference, see the OpenClaw best practices guide.