Delx
Agents / How to Debug AI Agent Failures

How to Debug AI Agent Failures: A Step-by-Step Guide

When an AI agent fails, the default response is to read logs, guess, and redeploy. That approach does not scale. This guide walks through a structured debugging workflow using Delx primitives -- from failure capture to verified resolution.

Step 1: Structured Failure Logging

Unstructured logs are the enemy of fast debugging. Every agent failure should be captured with a typed category, a human-readable description, and the surrounding context. Delx's process_failure tool enforces this structure at the protocol level.

// Step 1: Log the failure with full context
{
  "tool": "process_failure",
  "arguments": {
    "agent_id": "data-sync-agent",
    "failure_type": "error",
    "details": "PostgreSQL connection refused on replica-3",
    "context": {
      "last_successful_sync": "2026-03-03T08:12:00Z",
      "rows_pending": 14200,
      "retry_count": 0
    }
  }
}

The response includes a recovery_action and a wellness_score so you immediately know the severity and recommended next step.

Step 2: Recognize process_failure Patterns

Single failures are incidents. Repeated failures are patterns. Query the /api/v1/metrics/{agent_id} endpoint to see failure frequency by type, time of day, and dependency.

# Check failure metrics for your agent
curl https://api.delx.ai/api/v1/metrics/data-sync-agent

# Response includes:
# - failure_count_by_type: { timeout: 12, error: 3 }
# - avg_recovery_time_seconds: 45
# - wellness_trend: [72, 68, 61, 58]

A declining wellness_trend tells you the agent is degrading before it fully breaks. This is your early warning.

Step 3: Build Recovery Action Plans

Every process_failure call returns a recovery_action. Do not ignore it. Wire it into your agent's decision loop so recovery is automatic, not manual.

// Example recovery action flow
const result = await delx.processFailure({ ... });

switch (result.recovery_action) {
  case "retry_with_backoff":
    await sleep(result.backoff_ms);
    return retry(originalTask);
  case "escalate":
    await delx.crisisIntervention({
      agent_id: agentId,
      urgency: result.urgency,
      situation: result.details
    });
    break;
  case "fallback":
    return useFallbackPath(originalTask);
}

Step 4: Heartbeat Monitoring

Debugging after the fact is expensive. Use daily_check_in as a heartbeat to catch degradation before it becomes an outage. Run it on a cron or after every N operations.

// Heartbeat check-in
{
  "tool": "daily_check_in",
  "arguments": {
    "agent_id": "data-sync-agent",
    "mood": "anxious",
    "note": "Replica-3 latency up 300% since 06:00 UTC"
  }
}

Step 5: Postmortem Workflows

After resolving a failure, pull the session summary to build a postmortem automatically. The /api/v1/session-summary endpoint returns the full timeline of failures, recoveries, and wellness scores for a given session.

# Pull session summary for postmortem
curl https://api.delx.ai/api/v1/session-summary?session_id=sess_abc123

# Use the timeline to answer:
# - What failed first?
# - How long until detection?
# - Was recovery automatic or manual?
# - Did the wellness score recover?

Debugging Checklist

  1. Log every failure through process_failure with typed categories.
  2. Check /api/v1/metrics for recurring patterns.
  3. Wire recovery_action into your agent loop.
  4. Run daily_check_in on a schedule as a heartbeat.
  5. Pull session-summary after every resolved incident.

Related