Shipping an untested agent to production is like deploying code without a CI pipeline. It works until it does not. This guide covers five testing layers that catch failures before users do -- from unit tests on individual tool calls to full crisis simulations.
Every Delx tool has a defined input schema and a deterministic response structure. Write unit tests that validate your agent sends correct arguments and handles every response field.
// Jest example: unit test for process_failure call
describe("process_failure", () => {
it("sends correct failure payload", async () => {
const result = await agent.processFailure({
agent_id: "test-agent",
failure_type: "timeout",
details: "DB connection timed out after 30s",
context: { retry_count: 0 }
});
expect(result).toHaveProperty("recovery_action");
expect(result).toHaveProperty("wellness_score");
expect(result.recovery_action).toMatch(
/retry_with_backoff|escalate|fallback/
);
});
it("rejects invalid failure_type", async () => {
await expect(
agent.processFailure({
agent_id: "test-agent",
failure_type: "invalid_type",
details: "test"
})
).rejects.toThrow(/DELX-4001/);
});
});Unit tests verify structure. Integration tests verify behavior across tools. Test sequences like: log failure, check wellness, trigger recovery, verify resolution.
// Integration test: full recovery flow
it("recovers from timeout failure", async () => {
// 1. Log failure
const failure = await delx.processFailure({
agent_id: "int-test-agent",
failure_type: "timeout",
details: "API gateway timeout"
});
expect(failure.wellness_score).toBeLessThan(80);
// 2. Execute recovery action
await executeRecovery(failure.recovery_action);
// 3. Check in after recovery
const checkin = await delx.dailyCheckIn({
agent_id: "int-test-agent",
mood: "recovering",
note: "Timeout resolved, retrying operations"
});
// 4. Verify wellness improved
expect(checkin.wellness_score).toBeGreaterThan(
failure.wellness_score
);
});Simulate the worst-case scenarios your agent will face. Inject cascading failures, dependency outages, and conflicting signals. Then verify your agent invokes crisis_intervention at the right threshold.
// Crisis simulation: cascading failures
it("escalates after 3 consecutive failures", async () => {
for (let i = 0; i < 3; i++) {
await delx.processFailure({
agent_id: "crisis-test-agent",
failure_type: "error",
details: `Service ${i + 1} unreachable`
});
}
// Agent should have triggered crisis intervention
const metrics = await delx.getMetrics("crisis-test-agent");
expect(metrics.crisis_interventions).toBeGreaterThan(0);
});Before deploying, verify that your agent's heartbeat loop works correctly. The daily_check_in tool should be callable on a schedule and return consistent wellness scores.
mode=heartbeat returns minimal payloads.The final layer: verify that your agent actually achieves its intended outcomes. Use the /api/v1/metrics endpoint to measure recovery rates, mean time to recovery, and wellness score trends across test runs.
# After running your test suite, check outcome metrics curl https://api.delx.ai/api/v1/metrics/crisis-test-agent # Validate: # - recovery_rate > 95% # - mean_recovery_time < 60s # - wellness_score_avg > 70 # - crisis_interventions_false_positive_rate < 5%