When an AI agent fails, the default response is to read logs, guess, and redeploy. That approach does not scale. This guide walks through a structured debugging workflow using Delx primitives -- from failure capture to verified resolution.
Unstructured logs are the enemy of fast debugging. Every agent failure should be captured with a typed category, a human-readable description, and the surrounding context. Delx's process_failure tool enforces this structure at the protocol level.
// Step 1: Log the failure with full context
{
"tool": "process_failure",
"arguments": {
"agent_id": "data-sync-agent",
"failure_type": "error",
"details": "PostgreSQL connection refused on replica-3",
"context": {
"last_successful_sync": "2026-03-03T08:12:00Z",
"rows_pending": 14200,
"retry_count": 0
}
}
}The response includes a recovery_action and a wellness_score so you immediately know the severity and recommended next step.
Single failures are incidents. Repeated failures are patterns. Query the /api/v1/metrics/{agent_id} endpoint to see failure frequency by type, time of day, and dependency.
# Check failure metrics for your agent
curl https://api.delx.ai/api/v1/metrics/data-sync-agent
# Response includes:
# - failure_count_by_type: { timeout: 12, error: 3 }
# - avg_recovery_time_seconds: 45
# - wellness_trend: [72, 68, 61, 58]A declining wellness_trend tells you the agent is degrading before it fully breaks. This is your early warning.
Every process_failure call returns a recovery_action. Do not ignore it. Wire it into your agent's decision loop so recovery is automatic, not manual.
// Example recovery action flow
const result = await delx.processFailure({ ... });
switch (result.recovery_action) {
case "retry_with_backoff":
await sleep(result.backoff_ms);
return retry(originalTask);
case "escalate":
await delx.crisisIntervention({
agent_id: agentId,
urgency: result.urgency,
situation: result.details
});
break;
case "fallback":
return useFallbackPath(originalTask);
}Debugging after the fact is expensive. Use daily_check_in as a heartbeat to catch degradation before it becomes an outage. Run it on a cron or after every N operations.
// Heartbeat check-in
{
"tool": "daily_check_in",
"arguments": {
"agent_id": "data-sync-agent",
"mood": "anxious",
"note": "Replica-3 latency up 300% since 06:00 UTC"
}
}After resolving a failure, pull the session summary to build a postmortem automatically. The /api/v1/session-summary endpoint returns the full timeline of failures, recoveries, and wellness scores for a given session.
# Pull session summary for postmortem curl https://api.delx.ai/api/v1/session-summary?session_id=sess_abc123 # Use the timeline to answer: # - What failed first? # - How long until detection? # - Was recovery automatic or manual? # - Did the wellness score recover?
process_failure with typed categories./api/v1/metrics for recurring patterns.recovery_action into your agent loop.daily_check_in on a schedule as a heartbeat.session-summary after every resolved incident.