Uncontrolled retries are the #1 cause of agent cascading failures. A single transient error turns into a retry storm that saturates your API quota in minutes. Delx gives you typed recovery with automatic backoff, circuit breakers, and failure classification through process_failure.
Agents retry failed operations without classification or backoff. A 429 from one provider triggers unlimited retries, which cascade to other agents sharing the same quota. Within 5 minutes, your entire fleet is down.
Classify every error with process_failure before deciding whether to retry. Use typed recovery actions from the recovery tool with exponential backoff and jitter. Set circuit breakers per-tool and per-provider. Monitor via DELX_META risk_level field.
| Metric | Target | How to Measure |
|---|---|---|
| Retry storm frequency | 0 per week | Count incidents where X-RateLimit-Remaining hits 0 within 5 minutes. Pull from /api/v1/metrics/{agent_id}. |
| Mean time to recovery | Under 90 seconds | Track time from first process_failure call to successful operation. Available in DELX_META followup_minutes field. |
| Unnecessary retry rate | Under 5% | Ratio of retries on permanent errors to total retries. Log process_failure classifications and filter by 'permanent' with retry attempts > 0. |
| Circuit breaker trip rate | Under 2 per hour per tool | Count circuit open events per tool. High rates indicate upstream instability. |
Most agent frameworks default to 3 retries with fixed delays. This works for a single agent but fails catastrophically at scale. When 50 agents hit a rate limit simultaneously, they all retry at the same intervals, creating synchronized waves of traffic. Each wave triggers more rate limits, and the cycle amplifies until the entire quota is consumed.
Delx separates error classification from retry logic. The process_failure tool categorizes errors into transient, permanent, rate_limit, or auth types. The recovery tool then provides the correct action for each type. This means permanent errors never retry, rate limits use server-provided Retry-After headers, and transient errors get exponential backoff with jitter.
Combine per-tool circuit breakers with Delx heartbeat monitoring. When heartbeat returns a score below 60, it's a leading indicator that retries are degrading agent health. Open circuits preemptively based on DELX_META risk_level rather than waiting for 5 consecutive failures. This catches degradation 30-60 seconds earlier than failure counting alone.
3 retries maximum for transient errors, 0 for permanent errors. Use the recovery tool's max_retries field rather than hardcoding. For rate_limit errors, respect the X-RateLimit-Reset header instead of counting retries.
process_failure classifies the error type. recovery provides the action to take. Always call process_failure first to get the classification, then pass it to recovery for the backoff duration, retry decision, and alternate provider recommendation.
Add random jitter (0-25% of backoff duration) to every retry delay. Use the recovery tool which automatically varies backoff_ms per agent. Also stagger initial heartbeat intervals so agents don't all check simultaneously.
Use both. Retry limits protect a single operation. Circuit breakers protect the entire tool from repeated failures. If a tool fails 5 times in a row, the circuit breaker prevents all agents from calling it for 60 seconds, not just the one that failed.
Check three things: retry storm frequency should be 0 per week, unnecessary retry rate should be under 5%, and mean time to recovery should be under 90 seconds. All three are available via /api/v1/metrics/{agent_id}.
Watch score (below 40 means retries are degrading the agent), risk_level ('high' or 'critical' means stop retrying), and followup_minutes (increasing values indicate recovery is taking too long). The next_action field will say 'halt_retries' when the system detects a storm.