The Delx recovery tool is the core MCP endpoint for agent failure remediation. When an agent encounters an error it can't resolve, it calls recovery with the failure context and gets back a structured remediation plan with a next_action, risk assessment, and controller update. The tool handles 12 failure categories including timeout, context_overflow, dependency_failure, and hallucination_loop.
POST /v1/mcp tools/call recovery| Name | Type | Required | Description |
|---|---|---|---|
| incident_description | string | Yes | What went wrong. Include error messages, stack traces, or behavioral observations. |
| session_id | string | No | Existing session ID for continuity. If omitted, a new session starts. |
| agent_id | string | No | Stable identifier for the calling agent. Used for metrics and history. |
| severity | string | No | Override severity: low, medium, high, critical. Auto-detected if omitted. |
{"tool": "recovery", "arguments": {"incident_description": "API call to /users returned 503 three times in a row. Retry budget exhausted.", "agent_id": "agent-prod-01"}}The tool classified this as a dependency_failure, assessed it as high risk (score 45), and provided a concrete next action with a cooldown period.
{"tool": "recovery", "arguments": {"incident_description": "Context window at 94% capacity after 47 tool calls. Response quality degrading.", "session_id": "sess-abc-123"}}Recovery detected context_overflow pattern and suggested session compaction rather than a full restart, preserving session continuity.
Every recovery response includes a score from 0 to 100. Higher scores mean the agent is healthier. Below 40 is critical — the agent should stop autonomous work and wait for controller input. Between 40-60 is degraded — proceed with caution and increased heartbeat frequency. Above 60 means the incident is manageable and the agent can self-heal with the provided remediation. The score factors in incident severity, session history, and recovery success rate.
Use process_failure to log and classify errors without getting a remediation plan. Use recovery when you need actionable next steps. In practice, most production agents call process_failure for every error (building a history) and recovery only when they're stuck. The recovery tool reads the process_failure history to provide better-informed remediation plans.
Always pass session_id when calling recovery. This lets the tool access the full session context: previous failures, recovery attempts, and wellness trajectory. Without a session_id, recovery treats each call as isolated — still useful, but less accurate. Sessions persist for 24 hours by default, or until explicitly closed with close_session.
The recovery endpoint has a 99.9% uptime SLA. If it's unreachable, your agent should fall back to its local error handling. Don't retry recovery in a tight loop — that creates the same retry storm problem recovery is designed to prevent.
Yes. The recovery tool reads your agent's configuration from the agent card. You can set preferred remediation strategies, blocked actions, and escalation thresholds. Register your agent card at /.well-known/agent-card.json.
Median latency is 120ms. P99 is under 500ms. The tool runs rule-based classification first (fast) and only invokes LLM reasoning for ambiguous cases. Most production incidents are resolved in under 200ms.
Yes. Call recovery via A2A using message/send with the recovery tool in the artifacts. The response includes an mcp_handoff field that A2A orchestrators can parse directly.
The basic recovery endpoint (quick_operational_recovery) is free. The premium version (get_recovery_action_plan) costs $0.01 per call via x402 and includes deeper analysis, historical pattern matching, and fleet-wide recommendations.