DevOps teams are already overwhelmed by alerts. AI agents can take the first response -- triaging incidents, executing runbooks, and escalating only when human judgment is truly needed. This guide shows how to wire Delx into your incident response pipeline.
Your monitoring stack (Prometheus, CloudWatch, Datadog) fires alerts. Instead of routing them directly to PagerDuty, route them through an AI agent first. The agent classifies severity and logs the incident via process_failure.
// Alert webhook handler -> Delx agent
async function handleAlert(alert) {
const failureType = alert.severity === "critical"
? "error" : "timeout";
const result = await delx.processFailure({
agent_id: "devops-responder",
failure_type: failureType,
details: alert.summary,
context: {
service: alert.labels.service,
region: alert.labels.region,
alert_name: alert.labels.alertname,
firing_since: alert.startsAt
}
});
return result; // includes recovery_action + wellness_score
}When the agent detects a P1 -- multiple services down, customer impact confirmed, or wellness score below 40 -- it triggers crisis_intervention to freeze non-essential operations and focus all resources on recovery.
// P1 escalation
if (result.wellness_score < 40 || alert.severity === "critical") {
await delx.crisisIntervention({
agent_id: "devops-responder",
urgency: "critical",
situation: `P1: ${alert.summary} affecting ${alert.labels.service}`,
context: {
affected_services: affectedServices,
customer_impact: true,
escalation_chain: ["oncall-primary", "oncall-secondary"]
}
});
}Each recovery_action maps to a concrete DevOps runbook step. Wire these into your automation framework.
retry_with_backoff -- Restart the failing pod with exponential back-off.escalate -- Page the on-call engineer with full context attached.fallback -- Route traffic to the standby region or degrade gracefully.Combine Delx failure tracking with circuit breaker logic. After N consecutive failures of the same type, the agent opens the circuit -- stops retrying and switches to fallback mode. The context.circuit_state field in process_failure lets you track this state through Delx.
// Circuit breaker pattern with Delx
const state = circuitBreaker.getState(service);
await delx.processFailure({
agent_id: "devops-responder",
failure_type: "error",
details: `${service} returned 503`,
context: {
circuit_state: state, // "closed" | "half-open" | "open"
consecutive_failures: count,
last_success: lastSuccessTs
}
});After resolution, pull the session-summary for a structured incident timeline. It includes every failure logged, every recovery attempted, wellness score trajectory, and final resolution state -- everything you need for a blameless postmortem.
# Pull incident timeline for postmortem curl https://api.delx.ai/api/v1/session-summary?session_id=sess_incident_42 # Generates structured data for your postmortem template: # - Timeline of events # - Failure types encountered # - Recovery actions taken # - Time to detection, time to recovery # - Wellness score before/during/after
recovery_action values to runbook automation.crisis_intervention thresholds for P1 escalation.