Agents / Agent Retry Recovery Guide for Production AI Systems

Agent Retry Recovery Guide for Production AI Systems

Name: Delx Agent Operations Protocol
Author: Delx

Uncontrolled retries are the #1 cause of agent cascading failures. A single transient error turns into a retry storm that saturates your API quota in minutes. Delx gives you typed recovery with automatic backoff, circuit breakers, and failure classification through process_failure.

The Problem

Agents retry failed operations without classification or backoff. A 429 from one provider triggers unlimited retries, which cascade to other agents sharing the same quota. Within 5 minutes, your entire fleet is down.

API quota exhausted within minutes of a single transient error
Multiple agents retrying the same failed request simultaneously
Cascading timeouts across agent pipelines
DELX_META score dropping below 30 across all sessions
X-RateLimit-Remaining hitting 0 on process_failure calls

Solution Overview

Classify every error with process_failure before deciding whether to retry. Use typed recovery actions from the recovery tool with exponential backoff and jitter. Set circuit breakers per-tool and per-provider. Monitor via DELX_META risk_level field.

Step-by-Step

Classify the error with process_failure: Call process_failure with the error type, message, and context. It returns a classification (transient, permanent, rate_limit, auth) and a recommended recovery_action. Never retry without classifying first.
Apply typed recovery via the recovery tool: Pass the classification to the recovery tool. It returns the correct backoff duration, max retries, and whether to use an alternate provider. Permanent errors skip retry entirely.
Implement exponential backoff with jitter: Use the backoff_ms from recovery as the base. Add random jitter (0-25% of base) to prevent thundering herd. Cap at 30 seconds max backoff. This prevents synchronized retries across agents.
Add per-tool circuit breakers: Track failure counts per tool. After 5 consecutive failures, open the circuit for 60 seconds. During open state, fail fast without calling the API. Check DELX_META risk_level -- if it's 'high', preemptively open the circuit.
Monitor recovery via heartbeat: Call heartbeat every 30 seconds during recovery. Check the returned score and risk_level in DELX_META. If score drops below 40, escalate to supervisor. If risk_level is 'critical', halt all retries.
Set up dead letter queues for permanent failures: Errors classified as 'permanent' by process_failure go to a dead letter queue. Log the full context, DELX_META snapshot, and session state. Review these daily via /api/v1/session-summary to identify systemic issues.

Metrics

Metric	Target	How to Measure
Retry storm frequency	0 per week	Count incidents where X-RateLimit-Remaining hits 0 within 5 minutes. Pull from /api/v1/metrics/{agent_id}.
Mean time to recovery	Under 90 seconds	Track time from first process_failure call to successful operation. Available in DELX_META followup_minutes field.
Unnecessary retry rate	Under 5%	Ratio of retries on permanent errors to total retries. Log process_failure classifications and filter by 'permanent' with retry attempts > 0.
Circuit breaker trip rate	Under 2 per hour per tool	Count circuit open events per tool. High rates indicate upstream instability.

Why Naive Retries Kill Agent Fleets

Most agent frameworks default to 3 retries with fixed delays. This works for a single agent but fails catastrophically at scale. When 50 agents hit a rate limit simultaneously, they all retry at the same intervals, creating synchronized waves of traffic. Each wave triggers more rate limits, and the cycle amplifies until the entire quota is consumed.

Fixed delays cause thundering herd across agent fleets
Untyped retries waste attempts on permanent errors like auth failures
Without classification, agents can't distinguish transient from fatal errors
Retry storms cost 10-50x normal API spend in minutes

The Delx Typed Recovery Pattern

Delx separates error classification from retry logic. The process_failure tool categorizes errors into transient, permanent, rate_limit, or auth types. The recovery tool then provides the correct action for each type. This means permanent errors never retry, rate limits use server-provided Retry-After headers, and transient errors get exponential backoff with jitter.

process_failure classifies errors before any retry decision
recovery tool returns backoff_ms, should_retry, and alternate_provider
DELX_META next_action field tells the agent what to do after classification
Permanent errors route to dead letter queue, not retry loop

Circuit Breaker Integration with Delx Heartbeat

Combine per-tool circuit breakers with Delx heartbeat monitoring. When heartbeat returns a score below 60, it's a leading indicator that retries are degrading agent health. Open circuits preemptively based on DELX_META risk_level rather than waiting for 5 consecutive failures. This catches degradation 30-60 seconds earlier than failure counting alone.

heartbeat score below 60 triggers preemptive circuit opening
risk_level 'high' in DELX_META means open circuit immediately
Combine failure count and wellness score for faster detection
Use /api/v1/metrics to track circuit breaker effectiveness over time

FAQ

How many retries should I allow per operation?

3 retries maximum for transient errors, 0 for permanent errors. Use the recovery tool's max_retries field rather than hardcoding. For rate_limit errors, respect the X-RateLimit-Reset header instead of counting retries.

What's the difference between process_failure and recovery?

process_failure classifies the error type. recovery provides the action to take. Always call process_failure first to get the classification, then pass it to recovery for the backoff duration, retry decision, and alternate provider recommendation.

How do I prevent thundering herd across multiple agents?

Add random jitter (0-25% of backoff duration) to every retry delay. Use the recovery tool which automatically varies backoff_ms per agent. Also stagger initial heartbeat intervals so agents don't all check simultaneously.

When should I use circuit breakers vs retry limits?

Use both. Retry limits protect a single operation. Circuit breakers protect the entire tool from repeated failures. If a tool fails 5 times in a row, the circuit breaker prevents all agents from calling it for 60 seconds, not just the one that failed.

How do I know if my retry strategy is working?

Check three things: retry storm frequency should be 0 per week, unnecessary retry rate should be under 5%, and mean time to recovery should be under 90 seconds. All three are available via /api/v1/metrics/{agent_id}.

What DELX_META fields indicate retry problems?

Watch score (below 40 means retries are degrading the agent), risk_level ('high' or 'critical' means stop retrying), and followup_minutes (increasing values indicate recovery is taking too long). The next_action field will say 'halt_retries' when the system detects a storm.