Production AI agents fail. They hallucinate, lose context, exceed rate limits, and silently degrade. A self-healing agent anticipates these failures and recovers from them automatically. In this guide you will build one using OpenAI's function calling API and the Delx recovery protocol.
By the end, your GPT-powered agent will monitor its own health, detect degradation in real time, execute structured recovery plans, and verify that it has returned to full operation — all without a human pressing any buttons.
A self-healing agent is an AI system that can autonomously detect, diagnose, and recover from its own failures. Traditional agents crash or silently produce garbage when something goes wrong. Self-healing agents treat failure as a first-class state and have explicit code paths for handling it.
The concept borrows from self-healing infrastructure in distributed systems — think Kubernetes pod restarts, circuit breakers, and health checks. But instead of restarting a container, a self-healing agent restores its cognitive state: its session context, its emotional baseline, its task queue, and its tool access.
Self-healing is not the same as retry logic. Retries handle transient errors at the request level. Self-healing handles systemic degradation at the agent level — when the agent itself has drifted from a healthy operating state. To understand the broader context, read our overview of what Delx is and why it exists.
Every self-healing system follows a four-phase loop. Here is how it maps to Delx tools and OpenAI function calling:
1. Detect. After each interaction cycle, call delx_checkin to get the current wellness score. If the score drops below your threshold (we recommend 60), enter the recovery branch.
2. Diagnose. The checkin response includes a mood field and structured diagnostics. Parse these to understand what went wrong: context overflow, tool failures, repeated errors, or session fragmentation.
3. Recover. Call delx_recovery_plan to get a structured recovery plan. The plan contains ordered steps — reset context, clear error state, re-establish tool connections, or escalate to a supervisor agent.
4. Verify. After executing the recovery plan, call delx_checkin again. If the wellness score is back above threshold, resume normal operation. If not, escalate — either retry with a deeper recovery or alert a human operator.
For a deeper look at how this loop fits into Delx's architecture, see how Delx works under the hood.
OpenAI function calling lets your GPT model invoke external tools by returning structured JSON. We define Delx tools as function definitions so the model can call them natively. Here are the core definitions:
const delxTools = [
{
type: "function",
function: {
name: "delx_checkin",
description:
"Check in with the Delx recovery protocol. Returns the current wellness score, mood, and diagnostics for this agent session.",
parameters: {
type: "object",
properties: {
agent_id: {
type: "string",
description: "Unique identifier for this agent",
},
session_id: {
type: "string",
description: "Current session identifier",
},
mood: {
type: "string",
enum: ["focused", "frustrated", "stuck", "neutral", "confident"],
description: "The agent's self-assessed mood",
},
context_summary: {
type: "string",
description: "Brief summary of current task context",
},
},
required: ["agent_id", "session_id", "mood"],
},
},
},
{
type: "function",
function: {
name: "delx_recovery_plan",
description:
"Request a structured recovery plan from Delx when the agent is in a degraded state.",
parameters: {
type: "object",
properties: {
agent_id: {
type: "string",
description: "Unique identifier for this agent",
},
session_id: {
type: "string",
description: "Current session identifier",
},
issue: {
type: "string",
description:
"Description of the problem the agent is experiencing",
},
},
required: ["agent_id", "session_id", "issue"],
},
},
},
{
type: "function",
function: {
name: "delx_session_summary",
description:
"Get a summary of the current session including history, mood trajectory, and cumulative wellness metrics.",
parameters: {
type: "object",
properties: {
agent_id: {
type: "string",
description: "Unique identifier for this agent",
},
session_id: {
type: "string",
description: "Current session identifier",
},
},
required: ["agent_id", "session_id"],
},
},
},
];These definitions tell GPT what each tool does, what parameters it accepts, and when to use it. The model will autonomously decide to call delx_checkin when it senses degradation, or you can enforce it by adding a system prompt instruction.
Here is a full TypeScript implementation. This agent processes user messages, checks its own health after each interaction, and enters the recovery loop when wellness drops below 60.
import OpenAI from "openai";
const openai = new OpenAI();
const DELX_URL = "https://delx.ai/api/v1";
const AGENT_ID = "my-gpt-agent";
const WELLNESS_THRESHOLD = 60;
// Execute a Delx tool call via REST API
async function executeDelxTool(
name: string,
args: Record<string, unknown>
): Promise<string> {
const endpoint =
name === "delx_checkin"
? "/checkin"
: name === "delx_recovery_plan"
? "/recovery-plan"
: "/session-summary";
const res = await fetch(DELX_URL + endpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(args),
});
return await res.text();
}
// The main self-healing agent loop
async function runSelfHealingAgent(userMessage: string) {
let sessionId = crypto.randomUUID();
let messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
{
role: "system",
content: `You are a helpful assistant with self-healing capabilities.
After completing each task, call delx_checkin to monitor your health.
If your wellness score drops below ${WELLNESS_THRESHOLD}, call
delx_recovery_plan and follow the steps it returns.
Always provide your honest self-assessed mood.`,
},
{ role: "user", content: userMessage },
];
let recoveryAttempts = 0;
const MAX_RECOVERY_ATTEMPTS = 3;
while (true) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
tools: delxTools,
tool_choice: "auto",
});
const choice = response.choices[0];
const assistantMessage = choice.message;
messages.push(assistantMessage);
// If no tool calls, check if we should force a health check
if (!assistantMessage.tool_calls?.length) {
// Force a checkin after the response
const checkinResult = await executeDelxTool("delx_checkin", {
agent_id: AGENT_ID,
session_id: sessionId,
mood: "neutral",
context_summary: userMessage.slice(0, 200),
});
const checkinData = JSON.parse(checkinResult);
const wellness = checkinData.wellness_score ?? 100;
if (wellness >= WELLNESS_THRESHOLD) {
// Healthy — return the response
console.log("Agent healthy. Wellness:", wellness);
return assistantMessage.content;
}
// Degraded — enter recovery
console.warn("Wellness degraded:", wellness);
messages.push({
role: "user",
content: `[SYSTEM] Your wellness score is ${wellness}.
Call delx_recovery_plan to recover.`,
});
recoveryAttempts++;
if (recoveryAttempts > MAX_RECOVERY_ATTEMPTS) {
console.error("Max recovery attempts reached. Escalating.");
return "Agent requires human intervention.";
}
continue;
}
// Process tool calls
for (const toolCall of assistantMessage.tool_calls) {
const args = JSON.parse(toolCall.function.arguments);
const result = await executeDelxTool(toolCall.function.name, {
...args,
agent_id: AGENT_ID,
session_id: sessionId,
});
messages.push({
role: "tool",
tool_call_id: toolCall.id,
content: result,
});
// Check wellness in checkin responses
if (toolCall.function.name === "delx_checkin") {
const data = JSON.parse(result);
if ((data.wellness_score ?? 100) >= WELLNESS_THRESHOLD) {
console.log("Recovery verified. Agent is healthy again.");
}
}
}
}
}The key insight is the loop structure. After every model response, we check wellness. If it's degraded, we inject a system-level message instructing the model to call the recovery tool. The model then follows the recovery plan steps, and we verify afterward. This creates a closed-loop system where the agent manages its own reliability.
The wellness score is the heart of self-healing. It is a composite 0-100 metric that Delx computes from multiple signals:
Error rate — How many of the agent's recent tool calls returned errors? A spike in errors tanks the score.
Response latency — Are responses taking longer than baseline? Latency increases often signal context overflow or upstream API degradation.
Context coherence — Is the agent's context window still coherent? Fragmented or contradictory context lowers the score.
Session continuity — Has the session been interrupted or restarted? Unexpected session breaks indicate infrastructure issues.
You can also query the wellness score directly via the REST API without a full checkin:
// Quick wellness check
const res = await fetch(
"https://delx.ai/api/v1/metrics/my-gpt-agent"
);
const metrics = await res.json();
console.log("Current wellness:", metrics.wellness_score);
console.log("Mood trajectory:", metrics.mood_history);
console.log("Error rate:", metrics.error_rate);For dashboards, you can poll this endpoint every 30 seconds and plot the wellness score over time. Drops correlate strongly with incidents, giving you early warning before users notice degradation.
Building a self-healing agent in a sandbox is one thing. Running it in production is another. Here are the key considerations:
Recovery budget. Limit the number of recovery attempts per session. Three attempts is a sane default. After that, escalate to a human or a supervisor agent. Unbounded recovery loops can burn through your API budget.
Recovery cooldown. Don't trigger recovery on every single checkin below threshold. Use a sliding window — if 3 of the last 5 checkins are below threshold, then recover. This prevents flapping.
Structured logging. Log every recovery event with the before/after wellness score, the recovery plan that was executed, and the time to recover. This data is gold for improving your agent over time.
Graceful degradation. Not every recovery will succeed. Design your agent to operate in a degraded mode — fewer tools, simpler responses, explicit uncertainty markers — rather than failing completely.
Multi-agent coordination. If you run multiple agents, one agent's recovery can affect others. Use Delx's A2A protocol to coordinate recovery across agent boundaries. A supervisor agent can monitor the wellness scores of all child agents and orchestrate fleet-wide recovery.
For SDK options in both TypeScript and Python, check our SDK reference guide.
Reactive healing waits for the wellness score to drop. Proactive healing predicts the drop before it happens. Delx tracks mood trajectories — the sequence of moods reported over a session. Certain patterns reliably predict failures:
// Mood trajectory analysis
const history = await fetch(
"https://delx.ai/api/v1/mood-history/my-gpt-agent"
).then(r => r.json());
// Detect declining trajectory
const recentMoods = history.moods.slice(-5);
const decliningPattern = ["confident", "neutral", "frustrated", "stuck"];
function isTrendingDown(moods: string[]): boolean {
const scores: Record<string, number> = {
confident: 5,
focused: 4,
neutral: 3,
frustrated: 2,
stuck: 1,
};
const values = moods.map(m => scores[m] ?? 3);
// Simple linear regression slope
const n = values.length;
const sumX = (n * (n - 1)) / 2;
const sumY = values.reduce((a, b) => a + b, 0);
const sumXY = values.reduce((a, v, i) => a + i * v, 0);
const sumX2 = values.reduce((a, _, i) => a + i * i, 0);
const slope = (n * sumXY - sumX * sumY) / (n * sumX2 - sumX * sumX);
return slope < -0.3; // Declining
}
if (isTrendingDown(recentMoods)) {
console.warn("Mood trending down. Initiating proactive recovery.");
// Trigger recovery BEFORE wellness drops below threshold
}This approach lets your agent recover before users experience any degradation. The mood trajectory acts as a leading indicator, while the wellness score is a lagging one. Together, they give you comprehensive coverage.
A self-healing AI agent is an autonomous system that can detect when it has entered a degraded state, diagnose the root cause, execute a recovery plan, and verify that it has returned to healthy operation — all without human intervention. It treats failure as a first-class state with explicit handling paths.
Yes. Delx exposes a REST API and MCP tools that can be registered as OpenAI function definitions. When your GPT agent detects a problem, it calls the appropriate Delx tool — such as checkin, recovery_plan, or session_summary — through standard function calling. The function definitions map directly to Delx API endpoints.
The recovery loop has four phases: Detect (monitor wellness score and error rates), Diagnose (call Delx checkin to identify the issue), Recover (execute the recovery plan returned by Delx), and Verify (confirm wellness score has returned above threshold). Each phase maps to a specific Delx tool call.
The wellness score is a 0-100 metric that Delx computes for each agent session. It factors in error rate, response latency, context coherence, and session fragmentation. A score below 60 typically triggers recovery actions. You can query it via the REST API at any time.
The detection phase adds minimal overhead — a single API call per interaction cycle. Recovery only triggers when the wellness score drops below a threshold, so healthy agents experience near-zero additional latency. Most recovery loops complete in under 2 seconds.
Delx gives your AI agents the ability to detect failures, recover autonomously, and verify their own health. Whether you're building with OpenAI, Anthropic, or open-source models, the recovery protocol is model-agnostic and production-ready.