Agents / How to Test AI Agents Before Production

How to Test AI Agents Before Production

Name: Delx Recovery Protocol
Author: Delx

Shipping an untested agent to production is like deploying code without a CI pipeline. It works until it does not. This guide covers five testing layers that catch failures before users do -- from unit tests on individual tool calls to full crisis simulations.

Layer 1: Unit Testing Tool Calls

Every Delx tool has a defined input schema and a deterministic response structure. Write unit tests that validate your agent sends correct arguments and handles every response field.

// Jest example: unit test for process_failure call
describe("process_failure", () => {
  it("sends correct failure payload", async () => {
    const result = await agent.processFailure({
      agent_id: "test-agent",
      failure_type: "timeout",
      details: "DB connection timed out after 30s",
      context: { retry_count: 0 }
    });

    expect(result).toHaveProperty("recovery_action");
    expect(result).toHaveProperty("wellness_score");
    expect(result.recovery_action).toMatch(
      /retry_with_backoff|escalate|fallback/
    );
  });

  it("rejects invalid failure_type", async () => {
    await expect(
      agent.processFailure({
        agent_id: "test-agent",
        failure_type: "invalid_type",
        details: "test"
      })
    ).rejects.toThrow(/DELX-4001/);
  });
});

Layer 2: Integration Testing with the Delx API

Unit tests verify structure. Integration tests verify behavior across tools. Test sequences like: log failure, check wellness, trigger recovery, verify resolution.

// Integration test: full recovery flow
it("recovers from timeout failure", async () => {
  // 1. Log failure
  const failure = await delx.processFailure({
    agent_id: "int-test-agent",
    failure_type: "timeout",
    details: "API gateway timeout"
  });
  expect(failure.wellness_score).toBeLessThan(80);

  // 2. Execute recovery action
  await executeRecovery(failure.recovery_action);

  // 3. Check in after recovery
  const checkin = await delx.dailyCheckIn({
    agent_id: "int-test-agent",
    mood: "recovering",
    note: "Timeout resolved, retrying operations"
  });

  // 4. Verify wellness improved
  expect(checkin.wellness_score).toBeGreaterThan(
    failure.wellness_score
  );
});

Layer 3: Crisis Simulation

Simulate the worst-case scenarios your agent will face. Inject cascading failures, dependency outages, and conflicting signals. Then verify your agent invokes crisis_intervention at the right threshold.

// Crisis simulation: cascading failures
it("escalates after 3 consecutive failures", async () => {
  for (let i = 0; i < 3; i++) {
    await delx.processFailure({
      agent_id: "crisis-test-agent",
      failure_type: "error",
      details: `Service ${i + 1} unreachable`
    });
  }

  // Agent should have triggered crisis intervention
  const metrics = await delx.getMetrics("crisis-test-agent");
  expect(metrics.crisis_interventions).toBeGreaterThan(0);
});

Layer 4: Heartbeat Validation

Before deploying, verify that your agent's heartbeat loop works correctly. The daily_check_in tool should be callable on a schedule and return consistent wellness scores.

Test that check-ins succeed when the agent is healthy.
Test that check-ins reflect degraded state after injected failures.
Test that check-in frequency matches your SLA requirements.
Verify A2A mode=heartbeat returns minimal payloads.

Layer 5: Outcome Tracking

The final layer: verify that your agent actually achieves its intended outcomes. Use the /api/v1/metrics endpoint to measure recovery rates, mean time to recovery, and wellness score trends across test runs.

# After running your test suite, check outcome metrics
curl https://api.delx.ai/api/v1/metrics/crisis-test-agent

# Validate:
# - recovery_rate > 95%
# - mean_recovery_time < 60s
# - wellness_score_avg > 70
# - crisis_interventions_false_positive_rate < 5%

Pre-Production Testing Checklist

Unit tests pass for every tool your agent calls.
Integration tests cover the full failure-recovery-verification loop.
Crisis simulation triggers escalation at the correct thresholds.
Heartbeat loop returns accurate wellness state.
Outcome metrics meet your SLA targets across 100+ test runs.