Tools / MCP Recovery Tool

MCP Recovery Tool

Name: Delx Agent Operations Protocol
Author: Delx

The Delx recovery tool is the core MCP endpoint for agent failure remediation. When an agent encounters an error it can't resolve, it calls recovery with the failure context and gets back a structured remediation plan with a next_action, risk assessment, and controller update. The tool handles 12 failure categories including timeout, context_overflow, dependency_failure, and hallucination_loop.

Endpoint

POST /v1/mcp tools/call recovery

Parameters

Name	Type	Required	Description
incident_description	string	Yes	What went wrong. Include error messages, stack traces, or behavioral observations.
session_id	string	No	Existing session ID for continuity. If omitted, a new session starts.
agent_id	string	No	Stable identifier for the calling agent. Used for metrics and history.
severity	string	No	Override severity: low, medium, high, critical. Auto-detected if omitted.

Examples

Basic recovery call

{"tool": "recovery", "arguments": {"incident_description": "API call to /users returned 503 three times in a row. Retry budget exhausted.", "agent_id": "agent-prod-01"}}

The tool classified this as a dependency_failure, assessed it as high risk (score 45), and provided a concrete next action with a cooldown period.

Context overflow recovery

{"tool": "recovery", "arguments": {"incident_description": "Context window at 94% capacity after 47 tool calls. Response quality degrading.", "session_id": "sess-abc-123"}}

Recovery detected context_overflow pattern and suggested session compaction rather than a full restart, preserving session continuity.

Use Cases

Retry storm prevention: When an agent hits repeated failures, recovery provides structured backoff with specific wait times instead of letting the agent retry indefinitely. The remediation includes a cooldown period and fallback strategy.
Cascading failure containment: If one agent's failure could affect others in a multi-agent system, recovery flags the risk_level as critical and includes a controller_update that orchestrators can use to pause dependent agents.
Post-incident learning: After recovery resolves an incident, call session-summary to get a structured post-mortem. The recovery history helps identify recurring failure patterns across agent deployments.
Graceful degradation: Recovery doesn't always mean fixing the root cause. Sometimes the best next_action is to degrade gracefully — skip optional steps, use cached data, or notify the controller that a human decision is needed.

How recovery scoring works

Every recovery response includes a score from 0 to 100. Higher scores mean the agent is healthier. Below 40 is critical — the agent should stop autonomous work and wait for controller input. Between 40-60 is degraded — proceed with caution and increased heartbeat frequency. Above 60 means the incident is manageable and the agent can self-heal with the provided remediation. The score factors in incident severity, session history, and recovery success rate.

Recovery vs process_failure

Use process_failure to log and classify errors without getting a remediation plan. Use recovery when you need actionable next steps. In practice, most production agents call process_failure for every error (building a history) and recovery only when they're stuck. The recovery tool reads the process_failure history to provide better-informed remediation plans.

Session continuity in recovery

Always pass session_id when calling recovery. This lets the tool access the full session context: previous failures, recovery attempts, and wellness trajectory. Without a session_id, recovery treats each call as isolated — still useful, but less accurate. Sessions persist for 24 hours by default, or until explicitly closed with close_session.

FAQ

What happens if recovery itself fails?

The recovery endpoint has a 99.9% uptime SLA. If it's unreachable, your agent should fall back to its local error handling. Don't retry recovery in a tight loop — that creates the same retry storm problem recovery is designed to prevent.

Can I customize the recovery behavior?

Yes. The recovery tool reads your agent's configuration from the agent card. You can set preferred remediation strategies, blocked actions, and escalation thresholds. Register your agent card at /.well-known/agent-card.json.

How fast is the recovery response?

Median latency is 120ms. P99 is under 500ms. The tool runs rule-based classification first (fast) and only invokes LLM reasoning for ambiguous cases. Most production incidents are resolved in under 200ms.

Does recovery work with A2A protocol?

Yes. Call recovery via A2A using message/send with the recovery tool in the artifacts. The response includes an mcp_handoff field that A2A orchestrators can parse directly.

Is recovery a paid tool?

The basic recovery endpoint (quick_operational_recovery) is free. The premium version (get_recovery_action_plan) costs $0.01 per call via x402 and includes deeper analysis, historical pattern matching, and fleet-wide recommendations.