The most dangerous agent failure is the one that doesn't throw an error. The agent keeps running, keeps responding, but its outputs are wrong, incomplete, or irrelevant. Traditional error monitoring catches zero silent failures. You need behavioral monitoring: heartbeat gaps, score trends, output consistency, and DELX_META field presence.
Agents can enter degraded states where they produce plausible but incorrect output without triggering any error. They respond with confident but wrong answers, skip required steps, or return stale data. These failures are invisible to error logs and uptime monitors.
Implement four detection layers: heartbeat gap monitoring catches stuck agents, wellness score trend analysis catches degrading agents, output quality regression testing catches accuracy drops, and DELX_META field validation catches protocol-level failures. Each layer catches failures the others miss.
| Metric | Target | How to Measure |
|---|---|---|
| Silent failure detection time | Under 2 minutes | Time from when the agent enters a degraded state to when your monitoring system raises an alert. Baseline by injecting test failures and measuring detection latency. |
| False positive rate | Under 10% | Percentage of silent failure alerts that were actually normal agent behavior. Track by having engineers review each alert for the first month. Tune thresholds to reduce false positives. |
| DELX_META completeness rate | 100% | Percentage of agent responses that include all required DELX_META fields (score, risk_level, next_action, followup_minutes). Anything below 100% indicates integration issues. |
| Mean time between silent failures | Above 72 hours | Average time between detected silent failures per agent. Track via /api/v1/metrics. Increasing MTBSF indicates improving agent reliability. |
Traditional error monitoring watches for exceptions, HTTP 5xx responses, and crash logs. Silent failures produce none of these. The agent returns HTTP 200 with valid JSON containing plausible but wrong content. Error rates stay at 0%, uptime stays at 100%, and SLAs appear green. Meanwhile, the agent is confidently generating incorrect outputs that propagate through your entire pipeline.
Each detection layer catches a different class of silent failure. Heartbeat gaps catch stuck agents (hung on I/O, infinite loops). Score trends catch degrading agents (burnout, context overflow). DELX_META validation catches integration failures (broken tool calls, misconfigured pipelines). Output regression catches quality drops (shallow responses, skipped steps). You need all four because each has blind spots the others cover.
When a silent failure is detected, don't just restart the agent. First, capture the full state via /api/v1/session-summary. Second, classify the failure type using the detection layer that triggered the alert. Third, check if the failure affected downstream outputs by reviewing the agent's recent work. Fourth, rotate the agent using close_session with preserve_summary=true. Finally, add the failure pattern to your test suite to improve future detection.
Normal variation fluctuates around a baseline. Silent failures show a consistent downward trend or sustained deviation. Use a 5-minute rolling average of heartbeat scores. If it's 15+ points below the session baseline for more than 3 consecutive checks, it's a failure, not variation.
Context overflow leading to instruction forgetting. The agent stops following its system prompt but continues generating plausible responses. Detect this via output regression checks -- the responses are shorter and miss required steps.
Yes, for most cases. When detection triggers, automatically call close_session with preserve_summary, spawn a replacement agent, and inject the summary. Verify the new agent via heartbeat and a test query. Reserve manual review for failures that affect critical downstream systems.
Expect 15-25% false positive rate in the first week. Tune heartbeat gap thresholds, score decline slopes, and output regression ratios based on your specific agent behavior. After tuning, aim for under 10% false positive rate.
score is the primary health indicator. risk_level gives you categorical severity. next_action tells you what the system recommends. followup_minutes increasing beyond 10 is an early warning signal. Missing any of these fields is itself a failure signal.
Inject synthetic silent failures: have a test agent return increasingly shorter responses, skip tool calls, or omit DELX_META fields. Verify your monitoring catches each type within your 2-minute target. Run these tests weekly to ensure detection stays calibrated.