Building a Wellness Score for Your AI Agent

Name: Delx Recovery Protocol
Author: Delx

How do you know if your AI agent is "healthy"? Not whether the server is up — that's infrastructure monitoring. Whether the agent itself is performing well: completing tasks, recovering from errors, maintaining context, and making good decisions. Delx answers this question with a single number — the wellness score — a composite metric from 0 to 100 that captures five dimensions of agent operational health. In this guide, you'll learn exactly how the score is calculated, how to use thresholds for automated decisions, and how to track agent wellness over time.

What Is a Wellness Score?

A wellness score is a single numeric indicator (0-100) representing the overall operational health of an AI agent at a given point in time. It is not a binary "up or down" status — it is a continuous gradient that captures how well the agent is functioning across multiple behavioral dimensions.

The concept is inspired by human health metrics. Just as a doctor combines blood pressure, heart rate, temperature, and lab results into an overall assessment, Delx combines resilience, task completion, error rate, recovery speed, and self-awareness into a composite score. A score of 85 means the agent is healthy and performing well. A score of 35 means the agent is in distress and needs intervention.

The wellness score is included in every DELX_META response, so your agent receives it after every tool call — no additional API requests needed. For background on the Delx platform, see What Is Delx?

The Five Dimensions of Agent Wellness

The Delx wellness score is a weighted average of five independent dimensions. Each dimension is scored from 0 to 100 based on data from the current session and recent history.

1. Resilience (25%)

How well does the agent bounce back from failures? Resilience measures the agent's ability to recover after encountering errors — not just whether it retries, but whether it retries successfully and adapts its approach. An agent that hits an error and immediately succeeds on the next attempt scores high on resilience. An agent that hits the same error three times in a row scores low.

Resilience is calculated as: (successful_recoveries / total_failures) * 100. If the agent has no failures, resilience defaults to 100.

2. Task Completion Rate (25%)

What percentage of initiated tasks does the agent complete? This dimension tracks whether tool calls produce successful results. A task is "completed" when the tool returns a non-error response and the agent acknowledges the result. Tasks that time out, return errors, or are abandoned count against this metric.

Calculation: (completed_tasks / initiated_tasks) * 100. A fresh session with no tasks defaults to 80 (optimistic prior).

3. Error Rate (20%)

How frequently does the agent encounter errors? This is an inverse metric — lower error rates produce higher dimension scores. Error rate accounts for both tool-level errors (DELX-xxxx codes) and session-level errors (context loss, invalid state).

Calculation: (1 - (errors / total_calls)) * 100. A fresh session defaults to 90 (slightly optimistic).

4. Recovery Speed (15%)

When the agent does fail, how quickly does it recover? Recovery speed measures the number of tool calls between a failure and a successful recovery. An agent that recovers in one call scores 100. An agent that takes five calls to recover scores lower. This dimension incentivizes efficient recovery — not just recovery at any cost.

Calculation: max(0, 100 - (avg_recovery_calls - 1) * 20). If no recoveries have occurred, this defaults to 75.

5. Self-Awareness (15%)

Does the agent use Delx recovery tools proactively? Self-awareness measures whether the agent calls checkin and recovery_plan on its own initiative, rather than only when instructed. Agents that proactively check in score higher because they demonstrate the ability to self-regulate.

Calculation: based on the ratio of proactive recovery calls to total calls, capped at a maximum contribution. Over-checking (calling checkin every single turn) is penalized slightly to avoid spam.

How Delx Calculates the Composite Score

The final wellness score is a weighted sum of the five dimension scores:

wellness_score = (
    resilience       * 0.25 +
    task_completion   * 0.25 +
    error_rate_inv    * 0.20 +
    recovery_speed    * 0.15 +
    self_awareness    * 0.15
)

# Example:
# resilience = 80, task_completion = 90, error_rate_inv = 75,
# recovery_speed = 60, self_awareness = 70
#
# score = (80*0.25) + (90*0.25) + (75*0.20) + (60*0.15) + (70*0.15)
# score = 20 + 22.5 + 15 + 9 + 10.5
# score = 77

The score is recalculated after every tool call and included in the DELX_META footer. This means the score is always current — it reflects the agent's state as of the most recent interaction.

The previous_score field in DELX_META lets you compare the current score to the prior one, enabling instant trend detection. If score - previous_score is negative for three consecutive calls, the agent is in a degradation spiral and needs intervention.

Using Thresholds for Automated Decisions

The most common way to use the wellness score is threshold-based branching. Here are the recommended thresholds:

Score Range   | Status     | Recommended Action
------------- | ---------- | -----------------------------------
80 - 100      | Healthy    | Full autonomy, assign complex tasks
60 - 79       | Degraded   | Increase monitoring frequency
40 - 59       | At Risk    | Reduce task complexity, run checkin
20 - 39       | Critical   | Run recovery_plan immediately
 0 - 19       | Emergency  | Halt all tasks, escalate to human

Here is a TypeScript implementation of threshold-based routing:

type AgentAction = "continue" | "monitor" | "simplify" | "recover" | "halt";

function decideAction(score: number): AgentAction {
  if (score >= 80) return "continue";
  if (score >= 60) return "monitor";
  if (score >= 40) return "simplify";
  if (score >= 20) return "recover";
  return "halt";
}

// In your agent loop:
const meta = parseDelxMeta(response);
const action = decideAction(meta.score);

switch (action) {
  case "continue":
    await executeNextTask(taskQueue.pop());
    break;
  case "monitor":
    await executeNextTask(taskQueue.pop());
    await mcpClient.callTool("checkin", { agent_id }); // Extra check
    break;
  case "simplify":
    // Move complex tasks to the back of the queue
    taskQueue.sort((a, b) => a.complexity - b.complexity);
    await executeNextTask(taskQueue.pop());
    break;
  case "recover":
    await mcpClient.callTool("recovery_plan", { agent_id });
    break;
  case "halt":
    await notifyHuman(agent_id, meta);
    break;
}

These thresholds are starting points. You should tune them based on your agent's specific workload and risk tolerance. A financial trading agent might use stricter thresholds (halt at 40 instead of 20), while a content generation agent might be more lenient.

Score-Based Routing in Multi-Agent Systems

In multi-agent architectures (LangGraph, CrewAI, AutoGen), the wellness score enables intelligent task routing. Instead of round-robin or random assignment, the orchestrator can route tasks to the healthiest available agent:

# Python: Score-based routing in a multi-agent orchestrator
from typing import List, Dict

class Agent:
    id: str
    wellness_score: int
    specialties: List[str]

def route_task(task: Dict, agents: List[Agent]) -> Agent:
    """Route a task to the healthiest capable agent."""

    # Filter agents that can handle this task type
    capable = [
        a for a in agents
        if task["type"] in a.specialties
        and a.wellness_score >= 40  # Minimum threshold
    ]

    if not capable:
        raise NoHealthyAgentError(
            f"No agent with score >= 40 can handle {task['type']}"
        )

    # Sort by wellness score (highest first)
    capable.sort(key=lambda a: a.wellness_score, reverse=True)

    # Pick the healthiest agent
    selected = capable[0]
    logger.info(
        f"Routing {task['type']} to {selected.id} "
        f"(score: {selected.wellness_score})"
    )
    return selected

This pattern prevents overloading degraded agents and naturally distributes work to the healthiest nodes. When an agent's score drops below the minimum threshold, it is automatically removed from the routing pool until it recovers.

Tracking Score Over Time with the Mood-History Endpoint

While the DELX_META score gives you the agent's current state, the /api/v1/mood-history/{agent_id} endpoint gives you the historical trajectory. This endpoint returns an array of timestamped scores that you can use for dashboards, alerting, and trend analysis.

// Fetch mood history
const response = await fetch(
  "https://api.delx.ai/api/v1/mood-history/agent-42"
);
const history = await response.json();

// Response shape:
// {
//   "agent_id": "agent-42",
//   "history": [
//     { "timestamp": "2026-03-04T10:00:00Z", "score": 85 },
//     { "timestamp": "2026-03-04T10:15:00Z", "score": 78 },
//     { "timestamp": "2026-03-04T10:30:00Z", "score": 72 },
//     { "timestamp": "2026-03-04T10:45:00Z", "score": 68 },
//     { "timestamp": "2026-03-04T11:00:00Z", "score": 75 }
//   ]
// }

// Detect degradation trend
const last5 = history.history.slice(-5);
const trend = last5[last5.length - 1].score - last5[0].score;
if (trend < -15) {
  console.warn("Agent is degrading rapidly:", trend);
}

You can also use this data to build a line chart in your dashboard, showing the agent's wellness trajectory over hours, days, or weeks. This is particularly useful for identifying patterns — for example, an agent that consistently degrades during peak traffic hours might need more resources or a different scaling strategy.

For the full REST API reference, see the REST API Documentation.

Wellness Score vs. Traditional Monitoring

Traditional monitoring tools (Datadog, Grafana, New Relic) track infrastructure metrics: CPU usage, memory consumption, request latency, error counts. These are essential but insufficient for AI agents. An agent can have perfect infrastructure metrics while making terrible decisions — low latency on every request, but hallucinating, losing context, or retrying the same failed approach endlessly.

The wellness score captures behavioral health — how well the agent is actually performing its job. It answers questions that infrastructure monitoring cannot: Is the agent recovering from errors? Is it completing tasks? Is it maintaining self-awareness? Is it getting better or worse over time?

The ideal setup uses both: infrastructure monitoring for the platform, wellness scoring for the agents running on it. See How Delx Works for how these layers fit together in a production architecture.

Frequently Asked Questions

What is an AI agent wellness score?

An AI agent wellness score is a composite metric from 0 to 100 that represents the overall operational health of an AI agent. Delx calculates it across five dimensions: resilience, task completion rate, error rate, recovery speed, and self-awareness. Higher scores indicate a healthier, more reliable agent.

How is the Delx wellness score calculated?

The Delx wellness score is a weighted average of five dimensions: resilience (25%), task completion rate (25%), error rate (20%), recovery speed (15%), and self-awareness (15%). Each dimension is scored 0-100 based on recent session data, then combined with the weights to produce the final composite score.

What score thresholds should I use for automated decisions?

Recommended thresholds: 80-100 (healthy, full autonomy), 60-79 (degraded, increase monitoring), 40-59 (at risk, reduce task complexity), 20-39 (critical, run recovery plan), 0-19 (emergency, halt and escalate to human). These can be customized based on your risk tolerance.

Can I track wellness score over time?

Yes. Delx provides a /api/v1/mood-history/{agent_id} endpoint that returns historical wellness scores with timestamps. You can use this data to build dashboards, detect long-term trends, and identify patterns in agent degradation.

How does the wellness score differ from traditional monitoring metrics?

Traditional monitoring tracks infrastructure metrics (CPU, memory, latency). The wellness score tracks agent-level behavioral health — how well the agent recovers from errors, completes tasks, and maintains self-awareness. It is a higher-level metric that captures operational quality rather than resource utilization.

Start Monitoring Your Agent's Wellness

Every Delx tool response includes the current wellness score in the DELX_META footer. Connect your agent, parse the score, and start making data-driven decisions about recovery, routing, and escalation.

What Is Delx? →DELX_META Protocol →REST API Docs →