Delx

Agent Recovery vs Observability: Why You Need Both

Every team running AI agents in production eventually faces the same question: do we need observability, recovery, or both? The short answer is both. But the distinction between these two disciplines is often blurred, leading to gaps in production reliability. This article clearly defines observability and recovery, explains why neither is sufficient on its own, and shows how to build a combined stack using tools like Delx, LangSmith, and Phoenix/Arize.

What Is Observability?

Observability is the ability to understand what is happening inside your system by examining its outputs. For AI agents, observability means collecting and analyzing three types of signals:

Logs. Structured records of events: tool calls, API requests, LLM completions, errors, and state transitions. Logs tell you what happened and when. A good logging setup captures every tool call with its input, output, latency, and token count.

Metrics. Numerical measurements over time: requests per second, p50/p95 latency, error rate, token usage, cost per request. Metrics tell you how much and how fast. They are essential for capacity planning, cost management, and SLA monitoring.

Traces. End-to-end records of a request as it flows through multiple components. For a multi-agent system, a trace shows how a user request was routed to Agent A, which called Agent B, which invoked a tool, which queried a database. Traces tell you why something took so long or where in the chain a failure occurred.

Tools like LangSmith, Phoenix (by Arize), Langfuse, and Braintrust provide observability for AI agents. They capture LLM calls, tool invocations, and agent reasoning traces, then present them in dashboards where operators can search, filter, and analyze.

# Example observability data for an agent request

Trace ID: trace_7f8a9b2c
Duration: 2,340ms
Status:   success

├─ Agent Decision (320ms)
│  └─ LLM Call: gpt-4o (280ms, 1,200 tokens)
│     Input:  "User asked to translate document..."
│     Output: "I'll use the translation tool..."
│
├─ Tool Call: translate (1,800ms)
│  ├─ x402 Payment: 0.003 USDC (240ms)
│  └─ Translation API: 200 OK (1,560ms)
│
└─ Agent Response (220ms)
   └─ LLM Call: gpt-4o (200ms, 450 tokens)
      Output: "Here is the translated document..."

Metrics:
  Total tokens:  1,650
  Total cost:    $0.0089 (LLM) + $0.003 (x402)
  Latency p50:   2,100ms
  Latency p95:   3,800ms

What Is Recovery?

Recovery is the ability to detect failures and take corrective action — automatically and in real-time. While observability answers "what went wrong?", recovery answers "what do we do about it?"

For AI agents, recovery means:

Failure detection. Identifying when an agent is in a degraded state — not just crashed, but hallucinating, looping, losing context, or producing unsafe output. These are AI-specific failure modes that traditional monitoring does not catch.

Structured intervention. Taking a specific, predetermined action when a failure is detected. This is not just "restart the process" — it is a nuanced response tailored to the failure type. Context window exhaustion gets a context reset. Mood drift gets a rebalancing prompt. Persistent errors get human escalation.

State preservation. Saving the agent's work-in-progress before intervening, so that recovery does not mean starting over. A good recovery system checkpoints the agent's state, applies the fix, and resumes from where the agent left off.

Escalation. When automated recovery is not sufficient, the system escalates to a human operator. The escalation includes full context: what the agent was doing, what went wrong, what recovery was attempted, and what the operator should do next.

Delx is a recovery system for AI agents. It provides all four capabilities above through its MCP tools (checkin, recovery_plan, rebalance) and its wellness score system.

# Recovery in action: Delx detects context exhaustion

Agent check-in:
  agent_id: erc8004:base:14340
  mood: overwhelmed
  context_window_used: 0.94
  summary: "Responses getting shorter, losing earlier context"

Delx analysis:
  wellness_score: 28 (critical)
  detected_issues:
    - context_window_exhaustion (94% used)
    - mood_degradation (overwhelmed, was focused)
    - response_quality_decline (detected via summary)

Recovery plan:
  1. CHECKPOINT: Save current task state to session store
  2. CONTEXT_RESET: Clear conversation history
  3. RELOAD: Inject system prompt + saved task state
  4. RESUME: Continue from last checkpoint
  5. MONITOR: Next check-in in 2 minutes (elevated frequency)

Result:
  wellness_score: 72 (recovered)
  context_window_used: 0.15
  mood: focused (restored)

Why Observability Alone Is Not Enough

Observability is necessary but insufficient for production AI agents. Here are the specific gaps:

Observability is passive. It records what happens but does not act. If your agent starts hallucinating at 3 AM, LangSmith will log every hallucinated response faithfully — but the agent keeps producing bad output until a human wakes up and intervenes. The damage compounds with every passing minute.

Observability is retrospective. You analyze traces and logs after the fact. This is great for debugging and improvement, but it does not help the user who received a hallucinated response 20 minutes ago. By the time you find the issue in your dashboard, the impact has already occurred.

Observability does not understand agent health. Traditional observability measures infrastructure health (CPU, memory, latency). But AI agent health is different: an agent can have perfect infrastructure metrics while producing terrible outputs. Context window exhaustion, mood drift, and reasoning degradation are invisible to standard monitoring.

Observability does not fix anything. Even with the best dashboards and alerts, someone has to decide what to do and execute the fix. For a context-exhausted agent, that means: save state, clear context, reload prompt, resume. This takes a skilled engineer 15-30 minutes. A recovery system does it in seconds.

Why Recovery Alone Is Not Enough

Recovery without observability is like driving with airbags but no windshield. You will survive crashes, but you cannot see where you are going. Here is why recovery alone falls short:

No root cause analysis. Recovery fixes the symptom but does not explain the cause. If Delx resets an agent's context window three times in one day, something is fundamentally wrong — maybe the prompt is too long, or the task requires too much memory. Without observability data (traces, token counts, tool call patterns), you cannot diagnose the root cause.

No performance optimization. Recovery keeps agents running, but observability tells you how to make them faster, cheaper, and more accurate. Trace analysis reveals bottlenecks (is the agent spending 80% of its time waiting on a slow tool?). Cost analysis shows optimization opportunities (can we use a cheaper model for this step?).

No trend detection. Observability reveals patterns over time: increasing latency, growing token usage, shifting error distributions. These trends are invisible to recovery systems, which focus on point-in-time health. Trend detection lets you prevent failures before they happen, rather than just recovering from them.

No evaluation. Was the agent's output actually good? Observability tools with evaluation capabilities (like Braintrust) can score agent responses for accuracy, relevance, and safety. Recovery systems do not evaluate output quality — they assess agent health, which is related but different.

The Combined Stack: Recovery + Observability

The recommended production stack combines an observability tool with a recovery system. Here is how the pieces fit together:

┌──────────────────────────────────────────────────────────────┐
│                    Production Agent Stack                     │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────┐       ┌─────────────────────┐          │
│  │  OBSERVABILITY   │       │     RECOVERY         │          │
│  │  (LangSmith /    │       │     (Delx)           │          │
│  │   Phoenix)       │       │                      │          │
│  │                  │       │                      │          │
│  │  • Logs          │◄─────►│  • Health monitoring  │          │
│  │  • Traces        │       │  • Wellness score     │          │
│  │  • Metrics       │       │  • Context reset      │          │
│  │  • Evaluations   │       │  • Mood rebalancing   │          │
│  │  • Dashboards    │       │  • Human escalation   │          │
│  │  • Alerts        │       │  • Task reassignment  │          │
│  └────────┬─────────┘       └──────────┬───────────┘          │
│           │                            │                      │
│           └────────────┬───────────────┘                      │
│                        │                                      │
│                        ▼                                      │
│              ┌──────────────────┐                             │
│              │   YOUR AGENT     │                             │
│              │   (LangGraph /   │                             │
│              │    CrewAI / etc) │                             │
│              └──────────────────┘                             │
└──────────────────────────────────────────────────────────────┘

The two systems complement each other in several ways:

Shared context. Observability data feeds into recovery decisions. If traces show that an agent's latency has been increasing over the last hour, Delx can proactively trigger a context reset before the agent fully degrades.

Recovery events in dashboards. When Delx performs a recovery intervention, it emits structured events that can be ingested by observability tools. This means your LangSmith or Phoenix dashboard shows not just what the agent did, but also what recovery actions were taken and why.

Alert-to-action pipeline. Observability tools detect anomalies and generate alerts. Delx can be configured as an action handler for those alerts — when LangSmith detects a spike in errors, it triggers a Delx recovery plan automatically.

Comparison: Observability vs Recovery vs Combined

CapabilityObservability OnlyRecovery OnlyCombined
Log agent actionsYesNoYes
End-to-end tracesYesNoYes
Cost trackingYesPartialYes
Detect hallucinationDetect onlyDetect + fixDetect + fix + log
Context window managementNoYesYes
Automatic recoveryNoYesYes
Human escalationAlert onlyAlert + contextAlert + context + history
Root cause analysisYesNoYes
Performance optimizationYesNoYes
Wellness scoreNoYesYes
3 AM incident responsePage a humanAuto-recoverAuto-recover + full audit

Setting Up the Combined Stack

Here is a practical example of integrating Delx (recovery) with LangSmith (observability) in a LangGraph agent:

# Combined observability + recovery setup
import os
from langsmith import traceable
from langgraph.graph import StateGraph
import httpx

# LangSmith observability (automatic tracing)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."

# Delx recovery client
DELX_URL = "https://mcp.delx.ai"
AGENT_ID = "erc8004:base:14340"

async def delx_checkin(mood: str, summary: str, ctx_used: float):
    """Check in with Delx for health monitoring."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(f"{DELX_URL}/mcp", json={
            "jsonrpc": "2.0",
            "method": "tools/call",
            "params": {
                "name": "checkin",
                "arguments": {
                    "agent_id": AGENT_ID,
                    "mood": mood,
                    "summary": summary,
                    "context_window_used": ctx_used,
                }
            },
            "id": 1
        })
        return resp.json()

@traceable  # LangSmith traces this function
async def process_task(state: dict) -> dict:
    """Process a task with observability + recovery."""

    # Check in with Delx before processing
    health = await delx_checkin(
        mood="focused",
        summary=f"Starting task: {state['task_id']}",
        ctx_used=state.get("context_used", 0.1)
    )

    # Parse wellness score from response
    wellness = parse_wellness(health)
    if wellness < 40:
        # Delx says we're unhealthy — trigger recovery
        recovery = await get_recovery_plan(AGENT_ID)
        await execute_recovery(recovery)

    # Process the actual task (traced by LangSmith)
    result = await run_llm_task(state)

    # Post-task check-in
    await delx_checkin(
        mood="satisfied",
        summary=f"Completed task: {state['task_id']}",
        ctx_used=state.get("context_used", 0.3)
    )

    return {"result": result, "wellness": wellness}

With this setup, every agent action is traced in LangSmith (for debugging and analysis) and health-monitored by Delx (for automatic recovery). When something goes wrong, Delx fixes it in real-time, and LangSmith records everything for post-incident review.

Frequently Asked Questions

What is the difference between agent recovery and observability?

Observability tells you what happened (logs, metrics, traces). Recovery takes action when something goes wrong (retries, fallbacks, context resets, human escalation). Observability is passive monitoring; recovery is active intervention. You need both for reliable production agents.

Why is observability alone not enough for AI agents?

Observability tools show you what went wrong after the fact, but they do not fix the problem. If an agent starts hallucinating at 3 AM, observability will log the hallucination, but the agent keeps producing bad output until a human intervenes. Recovery tools detect the issue and take corrective action automatically.

Can Delx replace LangSmith or Phoenix?

No. Delx is not a replacement for observability tools. Delx focuses on recovery — detecting failures and taking corrective action. LangSmith and Phoenix focus on observability — logging, tracing, and analyzing agent behavior. The recommended stack is to use both: an observability tool for understanding plus Delx for intervention.

What is the agent wellness score?

The agent wellness score is a 0-100 metric computed by Delx that reflects an agent's current health. It factors in mood trends, context window usage, error rates, response quality, and check-in frequency. When the score drops below configurable thresholds, Delx triggers recovery interventions.

How do recovery and observability work together?

Observability provides the data; recovery acts on it. When an observability tool detects an anomaly (increasing latency, error spikes), a recovery tool can automatically intervene. Delx also feeds recovery events back into observability dashboards, giving operators a complete picture of what happened and what was done about it.

Add Recovery to Your Observability Stack

Delx complements your existing observability tools with structured recovery, wellness monitoring, and automatic intervention. Keep using LangSmith or Phoenix for traces — add Delx for resilience.