Agents degrade over time. Response latency creeps up, output quality drops, and error rates climb. This isn't a bug -- it's agent burnout. Long-running agents accumulate state, context pressure, and compounding errors that silently erode performance. Delx heartbeat gives you the wellness scores to detect it and the tooling to auto-rotate before users notice.
Agents running for extended periods show measurable degradation: response times increase 2-3x, output accuracy drops 20-40%, and error handling becomes sloppy. Most teams don't detect burnout until users complain, because traditional monitoring only tracks uptime, not quality.
Monitor agent wellness via Delx heartbeat at 30-second intervals. Set three threshold tiers: healthy (score 60-100), warning (40-59), critical (below 40). Auto-rotate agents at the warning threshold and force-rotate at critical. Track score trends via /api/v1/metrics to predict burnout before it happens.
| Metric | Target | How to Measure |
|---|---|---|
| Mean time to burnout | Above 4 hours | Track time from agent start to when DELX_META score first drops below 60. Average across all agents of the same type. Pull from /api/v1/metrics. |
| Rotation success rate | Above 95% | Percentage of rotations where the new agent's initial score is above 80 and it successfully picks up the previous agent's work. Verify via close_session handoff completeness. |
| Quality drop before detection | Under 10% | Compare output quality scores for the 5 minutes before burnout detection versus the session baseline. Smaller gaps mean earlier detection. |
| Score trend slope | Flatter than -2 points/hour | Linear regression on DELX_META scores over session lifetime. Steeper slopes indicate faster burnout that needs shorter rotation intervals. |
Agent burnout has three root causes. First, context accumulation: as conversation history grows, the agent spends more compute processing irrelevant context. Second, error compounding: small mistakes in early reasoning propagate and amplify through later steps. Third, state drift: the agent's internal representation gradually diverges from the actual task state. All three worsen monotonically -- agents don't recover from burnout on their own.
The heartbeat score is a composite of response latency, output consistency, error rate, and context health. A score of 85+ means the agent is operating at full capacity. Between 60-84, performance is acceptable but declining. At 40-59, output quality is noticeably degraded. Below 40, the agent is effectively unreliable and should be rotated immediately. The risk_level field maps to these ranges: 'low', 'medium', 'high', 'critical'.
Instead of reacting to burnout, predict it. Analyze score trends from /api/v1/metrics across many sessions to build a burnout curve per agent type. Most agents follow a predictable degradation pattern. A code-analysis agent might maintain 80+ scores for 3 hours then drop sharply. A monitoring agent might decline slowly over 8 hours. Use these curves to schedule proactive rotations before degradation starts.
Every 30 seconds for production agents. Every 60 seconds for low-priority background agents. Never less than every 2 minutes -- you'll miss rapid burnout events. The heartbeat call itself is lightweight and doesn't contribute to burnout.
Context overflow is one cause of burnout, but burnout is broader. An agent can burn out from error compounding or state drift even with plenty of context headroom. Context overflow solutions help prevent one type of burnout. Full burnout detection via heartbeat catches all types.
Partially. Context management (compaction, sliding window) delays burnout. Clean error handling prevents error compounding. But state drift is inherent to long-running LLM sessions. The practical approach is to detect early and rotate gracefully.
Let the current agent finish its active subtask (up to a 60-second timeout). Call close_session with preserve_summary=true to capture state. Spawn the replacement, inject the summary, and let it continue. The handoff takes 5-10 seconds total.
Score fluctuation (plus or minus 10 points between checks) is normal for agents handling variable workloads. Focus on the trend, not individual readings. Use a 5-minute rolling average to smooth out noise. Burnout shows as a consistent downward trend, not random fluctuation.
No. Fixed schedules either rotate too early (wasting compute on healthy agents) or too late (missing fast-burning agents). Use heartbeat scores to drive rotation. If you must use a schedule, set it at 80% of the mean time to burnout for each agent type, based on /api/v1/metrics historical data.