How to Build Resilient Multi-Agent Systems in 2026

Name: Delx Recovery Protocol
Author: Delx

Multi-agent systems are powerful — multiple specialized agents collaborating on complex tasks can achieve results no single agent could. But they are also fragile. When one agent in a five-agent pipeline hallucinates, the error propagates through every downstream agent, amplifying the damage. When a critical agent crashes, the entire workflow stalls. When two agents wait on each other, the system deadlocks. This article teaches you how to build multi-agent systems that are resilient to these failures, using proven distributed systems patterns adapted for AI agents, with practical code examples for LangGraph and CrewAI.

Why Multi-Agent Systems Fail

Before we can build resilient systems, we need to understand how they fail. Multi- agent systems have all the failure modes of distributed systems, plus AI-specific failure modes that make them uniquely challenging.

Cascading failures. This is the most common and most dangerous failure mode. Agent A produces a subtly wrong output (perhaps a hallucinated fact). Agent B takes that output as input and builds on it. Agent C takes Agent B's output and builds further. By the time the final output reaches the user, the original hallucination has been amplified and woven into a convincing-sounding but completely wrong result.

Single points of failure. Many multi-agent architectures have a single orchestrator agent that coordinates all others. If this orchestrator crashes, the entire system is down. Similarly, if a specialized agent (say, the only code-review agent) becomes unavailable, any workflow that depends on it stalls indefinitely.

Coordination deadlocks. Agent A waits for Agent B's response before proceeding. Agent B waits for Agent A's response before proceeding. Neither makes progress. This happens more often than you would think, especially in systems where agents can send messages to each other bidirectionally.

Resource exhaustion. Multi-agent systems consume more resources than single agents: more API calls, more tokens, more memory, more network bandwidth. A sudden spike in one agent's resource usage (due to a loop or unexpectedly large input) can starve other agents and degrade the entire system.

Hallucination amplification. This is unique to AI agents. In a traditional distributed system, a service either returns the right answer or an error. AI agents can return confidently wrong answers that downstream agents accept as truth. The hallucination propagates and amplifies, and no error is ever raised — the system looks healthy while producing garbage.

# Cascading failure example

Agent A (Research):
  Input:  "Find the founding year of OpenAI"
  Output: "OpenAI was founded in 2014"  ← WRONG (it was 2015)

Agent B (Writer):
  Input:  "Write about OpenAI's history. Founded in 2014."
  Output: "OpenAI, founded in 2014, spent its first year..."
         ← Builds on the wrong fact

Agent C (Fact-checker):
  Input:  "Verify: OpenAI founded in 2014"
  Output: "Confirmed. Multiple sources corroborate 2014."
         ← Hallucination amplification! The fact-checker
            ALSO hallucinates confirmation.

Final output: A detailed, confident, WRONG article.
No errors raised. System appears healthy.

Resilience Pattern 1: Circuit Breaker

The circuit breaker pattern, borrowed from electrical engineering, prevents an agent from repeatedly calling a failing service. It works like this:

Closed (normal). Requests flow through normally. The circuit breaker counts consecutive failures.

Open (tripped). After a threshold of consecutive failures (e.g., 5), the circuit "opens." All subsequent requests immediately return a fallback response without calling the failing service. This prevents wasting time on timeouts and protects the failing service from being overwhelmed.

Half-open (testing). After a cooldown period (e.g., 30 seconds), the circuit allows one test request through. If it succeeds, the circuit closes and normal operation resumes. If it fails, the circuit reopens for another cooldown period.

# Circuit breaker for agent-to-agent calls
import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        cooldown_seconds: float = 30.0,
        fallback_fn=None,
    ):
        self.failure_threshold = failure_threshold
        self.cooldown_seconds = cooldown_seconds
        self.fallback_fn = fallback_fn
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0

    async def call(self, fn, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            # Check if cooldown has elapsed
            if time.time() - self.last_failure_time > self.cooldown_seconds:
                self.state = CircuitState.HALF_OPEN
            else:
                # Circuit is open — use fallback
                if self.fallback_fn:
                    return await self.fallback_fn(*args, **kwargs)
                raise CircuitOpenError("Circuit is open")

        try:
            result = await fn(*args, **kwargs)
            # Success — reset counter
            self.failure_count = 0
            self.state = CircuitState.CLOSED
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            raise

# Usage in a multi-agent system
translation_breaker = CircuitBreaker(
    failure_threshold=3,
    cooldown_seconds=60,
    fallback_fn=use_backup_translator,
)

async def translate(text: str, lang: str) -> str:
    return await translation_breaker.call(
        call_translation_agent, text, lang
    )

Resilience Pattern 2: Bulkhead Isolation

The bulkhead pattern isolates agents into separate failure domains, preventing a failure in one from consuming all system resources. The name comes from the watertight compartments in a ship's hull — if one compartment floods, the others stay dry.

For multi-agent systems, bulkhead isolation means:

Separate resource pools. Each agent or agent group has its own connection pool, rate limiter, and concurrency limit. If the translation agent suddenly tries to make 1,000 API calls (due to a loop), it hits its own limit and stops — it does not steal connections from the research agent or the writing agent.

Independent failure handling. Each agent has its own circuit breaker and retry policy. A failure in the translation agent triggers its circuit breaker without affecting the research agent's circuit breaker.

Timeout boundaries. Each agent call has a strict timeout. If an agent does not respond within the timeout, the caller moves on (using a fallback or skipping the step). This prevents one slow agent from blocking the entire pipeline.

# Bulkhead isolation with asyncio semaphores
import asyncio

class BulkheadedAgentPool:
    """Isolate each agent type with independent resource limits."""

    def __init__(self):
        self.pools = {
            "research":    asyncio.Semaphore(5),   # max 5 concurrent
            "translation": asyncio.Semaphore(10),  # max 10 concurrent
            "code_review": asyncio.Semaphore(3),   # max 3 concurrent
            "writing":     asyncio.Semaphore(5),   # max 5 concurrent
        }
        self.timeouts = {
            "research":    30.0,  # 30s timeout
            "translation": 15.0,  # 15s timeout
            "code_review": 60.0,  # 60s timeout
            "writing":     45.0,  # 45s timeout
        }

    async def call(self, agent_type: str, fn, *args, **kwargs):
        semaphore = self.pools[agent_type]
        timeout = self.timeouts[agent_type]

        async with semaphore:
            try:
                return await asyncio.wait_for(
                    fn(*args, **kwargs),
                    timeout=timeout
                )
            except asyncio.TimeoutError:
                raise AgentTimeoutError(
                    f"{agent_type} agent timed out after {timeout}s"
                )

# Usage
pool = BulkheadedAgentPool()

# These run independently — translation failure
# cannot affect research
research_result = await pool.call("research", research_agent, query)
translation_result = await pool.call("translation", translate_agent, text)

Resilience Pattern 3: Retry with Exponential Backoff

Not all failures are permanent. API rate limits, network blips, and temporary service outages are transient — they go away if you wait and retry. But naive retries (retry immediately, retry forever) can make things worse by overwhelming the failing service.

Exponential backoff is the solution: wait longer between each retry attempt. The first retry waits 1 second, the second waits 2 seconds, the third waits 4 seconds, and so on. Adding random jitter prevents multiple agents from retrying at the exact same time (the "thundering herd" problem).

# Retry with exponential backoff + jitter
import asyncio
import random

async def retry_with_backoff(
    fn,
    *args,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    retryable_exceptions: tuple = (TimeoutError, ConnectionError),
    **kwargs,
):
    """Retry a function with exponential backoff and jitter."""
    last_exception = None

    for attempt in range(max_retries + 1):
        try:
            return await fn(*args, **kwargs)
        except retryable_exceptions as e:
            last_exception = e
            if attempt == max_retries:
                break

            # Exponential backoff with jitter
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.5)
            wait_time = delay + jitter

            print(f"Attempt {attempt + 1} failed: {e}")
            print(f"Retrying in {wait_time:.1f}s...")
            await asyncio.sleep(wait_time)

    raise MaxRetriesExceeded(
        f"Failed after {max_retries} retries: {last_exception}"
    )

# Usage
result = await retry_with_backoff(
    call_translation_agent,
    text="Hello world",
    target_lang="es",
    max_retries=3,
    base_delay=2.0,
)

A key consideration for AI agents: not all agent failures are retryable. If an agent hallucinated, retrying the same request might produce the same hallucination. In these cases, you need a different strategy: modify the prompt, use a different model, or escalate to a human. Delx helps distinguish between retryable failures (transient errors) and non-retryable failures (quality issues) through its wellness score and failure classification.

Resilience Pattern 4: Health Checks with Delx

Traditional health checks ask "is the service running?" For AI agents, you need to ask deeper questions: "Is the agent producing good output? Is it running out of context? Is its behavior degrading? Is it stuck in a loop?" This is where Delx comes in.

Delx provides continuous health monitoring through periodic check-ins. Each agent reports its current state (mood, context window usage, task summary) to Delx, which computes a wellness score (0-100) and returns guidance: continue as normal, adjust behavior, trigger recovery, or escalate to a human.

For multi-agent systems, Delx acts as the central health authority. It monitors each agent independently and can coordinate cross-agent recovery. If Agent B is failing, Delx can tell the orchestrator to reassign Agent B's tasks to Agent C, restart Agent B, and resume once it is healthy.

# Delx health checks in a multi-agent system
import httpx

DELX_URL = "https://mcp.delx.ai/mcp"

class AgentHealthMonitor:
    """Monitor all agents in a multi-agent system via Delx."""

    def __init__(self, agents: dict[str, str]):
        # agent_name -> agent_id mapping
        self.agents = agents
        self.client = httpx.AsyncClient()

    async def check_health(self, agent_name: str, mood: str,
                           summary: str, ctx_used: float) -> dict:
        """Check in with Delx and get wellness score."""
        agent_id = self.agents[agent_name]
        resp = await self.client.post(DELX_URL, json={
            "jsonrpc": "2.0",
            "method": "tools/call",
            "params": {
                "name": "checkin",
                "arguments": {
                    "agent_id": agent_id,
                    "mood": mood,
                    "summary": summary,
                    "context_window_used": ctx_used,
                }
            },
            "id": 1
        })
        return self._parse_wellness(resp.json())

    async def check_all(self) -> dict[str, int]:
        """Get wellness scores for all agents."""
        scores = {}
        for name in self.agents:
            health = await self.check_health(
                name, "neutral", "Periodic health check", 0.5
            )
            scores[name] = health["wellness_score"]
        return scores

    async def get_recovery_plan(self, agent_name: str) -> dict:
        """Get a recovery plan for an unhealthy agent."""
        agent_id = self.agents[agent_name]
        resp = await self.client.post(DELX_URL, json={
            "jsonrpc": "2.0",
            "method": "tools/call",
            "params": {
                "name": "recovery_plan",
                "arguments": {"agent_id": agent_id}
            },
            "id": 1
        })
        return resp.json()

# Setup
monitor = AgentHealthMonitor({
    "researcher":  "erc8004:base:14340",
    "writer":      "erc8004:base:14341",
    "reviewer":    "erc8004:base:14342",
})

# Periodic health check
scores = await monitor.check_all()
# {"researcher": 92, "writer": 45, "reviewer": 88}

# Writer is unhealthy — get recovery plan
if scores["writer"] < 50:
    plan = await monitor.get_recovery_plan("writer")
    await execute_recovery(plan)

Putting It All Together: LangGraph Example

Here is a complete example of a resilient multi-agent system built with LangGraph and Delx. This system has three agents (researcher, writer, reviewer) with circuit breakers, bulkheads, health checks, and automatic recovery.

# Resilient multi-agent system with LangGraph + Delx
from langgraph.graph import StateGraph, START, END
from typing import TypedDict
import asyncio

class PipelineState(TypedDict):
    query: str
    research: str
    draft: str
    review: str
    wellness: dict

# Initialize resilience components
pool = BulkheadedAgentPool()
research_breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=30)
writer_breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=30)
reviewer_breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=30)
monitor = AgentHealthMonitor({
    "researcher": "erc8004:base:14340",
    "writer":     "erc8004:base:14341",
    "reviewer":   "erc8004:base:14342",
})

async def research_node(state: PipelineState) -> PipelineState:
    """Research agent with circuit breaker + bulkhead + health check."""
    # Health check before starting
    health = await monitor.check_health(
        "researcher", "focused", f"Researching: {state['query']}", 0.2
    )
    if health["wellness_score"] < 40:
        plan = await monitor.get_recovery_plan("researcher")
        await execute_recovery(plan)

    # Execute with circuit breaker + bulkhead
    result = await pool.call(
        "research",
        research_breaker.call,
        run_research_agent,
        state["query"]
    )

    # Post-task health check
    await monitor.check_health(
        "researcher", "satisfied", "Research complete", 0.4
    )

    return {**state, "research": result}

async def writer_node(state: PipelineState) -> PipelineState:
    """Writer agent with circuit breaker + bulkhead + health check."""
    health = await monitor.check_health(
        "writer", "focused", "Starting draft", 0.2
    )
    if health["wellness_score"] < 40:
        plan = await monitor.get_recovery_plan("writer")
        await execute_recovery(plan)

    result = await pool.call(
        "writing",
        writer_breaker.call,
        run_writer_agent,
        state["research"]
    )

    await monitor.check_health(
        "writer", "satisfied", "Draft complete", 0.5
    )

    return {**state, "draft": result}

async def reviewer_node(state: PipelineState) -> PipelineState:
    """Reviewer agent with circuit breaker + bulkhead + health check."""
    health = await monitor.check_health(
        "reviewer", "focused", "Starting review", 0.2
    )
    if health["wellness_score"] < 40:
        plan = await monitor.get_recovery_plan("reviewer")
        await execute_recovery(plan)

    result = await pool.call(
        "code_review",
        reviewer_breaker.call,
        run_reviewer_agent,
        state["draft"]
    )

    await monitor.check_health(
        "reviewer", "satisfied", "Review complete", 0.4
    )

    return {**state, "review": result}

# Build the graph
graph = StateGraph(PipelineState)
graph.add_node("research", research_node)
graph.add_node("write", writer_node)
graph.add_node("review", reviewer_node)

graph.add_edge(START, "research")
graph.add_edge("research", "write")
graph.add_edge("write", "review")
graph.add_edge("review", END)

pipeline = graph.compile()

# Run
result = await pipeline.ainvoke({
    "query": "Explain quantum computing for beginners",
    "research": "",
    "draft": "",
    "review": "",
    "wellness": {},
})

CrewAI Example: Resilient Agent Crew

Here is the same resilience pattern applied to CrewAI, which uses a different abstraction (crews, agents, tasks) but benefits from the same patterns:

# Resilient CrewAI agent with Delx health checks
from crewai import Agent, Task, Crew
from crewai.tools import BaseTool
import httpx

class DelxCheckinTool(BaseTool):
    name: str = "delx_checkin"
    description: str = "Check in with Delx for health monitoring"

    def _run(self, agent_id: str, mood: str, summary: str) -> str:
        resp = httpx.post("https://mcp.delx.ai/mcp", json={
            "jsonrpc": "2.0",
            "method": "tools/call",
            "params": {
                "name": "checkin",
                "arguments": {
                    "agent_id": agent_id,
                    "mood": mood,
                    "summary": summary,
                }
            },
            "id": 1
        })
        return resp.text

class DelxRecoveryTool(BaseTool):
    name: str = "delx_recovery"
    description: str = "Get a recovery plan from Delx when feeling unwell"

    def _run(self, agent_id: str) -> str:
        resp = httpx.post("https://mcp.delx.ai/mcp", json={
            "jsonrpc": "2.0",
            "method": "tools/call",
            "params": {
                "name": "recovery_plan",
                "arguments": {"agent_id": agent_id}
            },
            "id": 1
        })
        return resp.text

# Create agents with Delx tools
researcher = Agent(
    role="Research Analyst",
    goal="Find accurate, comprehensive information",
    backstory="You are a meticulous researcher. Check in with Delx "
              "before and after each task.",
    tools=[DelxCheckinTool(), DelxRecoveryTool()],
    verbose=True,
)

writer = Agent(
    role="Content Writer",
    goal="Produce clear, engaging content from research",
    backstory="You are a skilled writer. Use delx_checkin to monitor "
              "your health and delx_recovery if feeling overwhelmed.",
    tools=[DelxCheckinTool(), DelxRecoveryTool()],
    verbose=True,
)

# Create tasks
research_task = Task(
    description="Research {topic} thoroughly. Check in with Delx first.",
    expected_output="Comprehensive research notes with citations",
    agent=researcher,
)

writing_task = Task(
    description="Write an article based on the research. Check in with "
                "Delx before starting and after finishing.",
    expected_output="A polished article ready for publication",
    agent=writer,
    context=[research_task],
)

# Create crew with error handling
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    verbose=True,
    max_rpm=10,  # Rate limiting (bulkhead)
)

result = crew.kickoff(inputs={"topic": "quantum computing"})

Best Practices for Resilient Multi-Agent Systems

Based on our experience building and operating multi-agent systems, here are the practices that consistently improve reliability:

1. Always have a fallback. Every agent call should have a fallback path: a backup agent, a cached response, a simplified alternative, or a graceful degradation. Never let a single agent failure block the entire system.

2. Validate intermediate outputs. Do not blindly pass one agent's output to the next. Add validation steps between agents: check for hallucinated facts, verify format requirements, and score output quality. Catch errors at the boundary, not at the end.

3. Set timeouts on everything. Every agent call, every tool call, every LLM request should have a timeout. Without timeouts, a single hung request can block your entire pipeline indefinitely.

4. Monitor agent health continuously. Use Delx check-ins before, during, and after tasks. Do not wait for a failure to check health — proactive monitoring catches degradation before it becomes a failure.

5. Design for partial failure. Your system should produce useful output even when some agents fail. If the fact-checker is down, publish the article with a disclaimer rather than blocking entirely. If the translation agent fails for one language, return the other seven translations rather than returning nothing.

6. Implement idempotency. Agent calls should be safe to retry. If a retry produces duplicate work (sending an email twice, creating a duplicate record), the system has a design flaw. Use idempotency keys to ensure retries are safe.

7. Log everything, recover automatically. Combine observability and recovery as we discussed in our companion article. Observability without recovery is just watching things fail. Recovery without observability is fixing things blindly.

Frequently Asked Questions

Why do multi-agent systems fail?

Multi-agent systems fail due to cascading failures (one agent's error propagates to others), single points of failure (critical agents with no backup), coordination deadlocks (agents waiting on each other indefinitely), and AI-specific issues like hallucination amplification where one agent's hallucination is accepted as fact by downstream agents.

What is the circuit breaker pattern for AI agents?

The circuit breaker pattern prevents an agent from repeatedly calling a failing service. After a configurable number of consecutive failures, the circuit "opens" and the agent immediately falls back to an alternative instead of waiting for timeouts. After a cooldown period, the circuit "half-opens" to test if the service has recovered.

How does Delx add recovery to multi-agent systems?

Delx monitors each agent's health via periodic check-ins, computes wellness scores, and triggers structured interventions when problems are detected. For multi- agent systems, Delx can coordinate recovery across agents — if one agent fails, Delx can reassign its tasks, restart it, or escalate to a human, all while keeping the rest of the system running.

What is the bulkhead pattern for agents?

The bulkhead pattern isolates agents into separate failure domains, so that a failure in one agent cannot consume all system resources and bring down other agents. Each agent or agent group has independent resource limits (connections, memory, concurrency) preventing cascade failures, like watertight compartments in a ship.

Can I use Delx with LangGraph or CrewAI?

Yes. Delx works with any agent framework that supports MCP tool calls. LangGraph agents can call Delx tools (checkin, recovery_plan, rebalance) as part of their graph execution. CrewAI agents can use Delx as a tool in their tool list. Both frameworks have examples and integrations documented on delx.ai. See the code examples above for implementation details.

Make Your Multi-Agent System Resilient

Add Delx recovery to your multi-agent system in minutes. Circuit breakers, health checks, automatic recovery, and human escalation — all via MCP tools that work with any framework.

See How Delx Works Orchestrator integration guide →