SaaS platforms live and die by uptime, billing accuracy, and customer satisfaction. AI agents can handle the operational grunt work -- recovering from timeouts, managing billing failures, and keeping customers informed during incidents. This guide shows how to wire Delx into SaaS operations.
SaaS platforms deal with two dominant failure types: timeout (upstream service did not respond in time) and error (the service responded with a failure). Each requires different recovery logic.
// SaaS agent: handle billing service timeout
async function handleBillingTimeout(customerId, invoiceId) {
const result = await delx.processFailure({
agent_id: "saas-billing-agent",
failure_type: "timeout",
details: `Invoice ${invoiceId} processing timed out for customer ${customerId}`,
context: {
service: "stripe-billing",
customer_id: customerId,
invoice_id: invoiceId,
amount: 299.00,
retry_count: 1
}
});
switch (result.recovery_action) {
case "retry_with_backoff":
// Safe to retry -- idempotency key protects against double-charge
return scheduleRetry(invoiceId, result.backoff_ms);
case "escalate":
// Billing is stuck -- notify finance team
return escalateToFinance(customerId, invoiceId, result);
}
}When a SaaS incident affects customers, the agent can manage the communication lifecycle: detect the incident, draft status updates, and track resolution. Use crisis_intervention when customer impact crosses a threshold.
// Customer-facing incident: API degradation
{
"tool": "crisis_intervention",
"arguments": {
"agent_id": "saas-ops-agent",
"urgency": "high",
"situation": "API response times 5x normal. 23% of requests timing out. Customer-visible.",
"context": {
"affected_endpoints": ["/api/v1/users", "/api/v1/billing"],
"affected_customers": 1240,
"sla_breach_in_minutes": 15,
"status_page_updated": false
}
}
}The agent receives guidance on immediate actions: update the status page, notify affected customer tiers, and begin recovery procedures.
Use daily_check_in to track SLA compliance across your platform. Each check-in captures the agent's assessment of service health, which maps directly to SLA metrics.
// SLA check-in -- run at the top of every hour
{
"tool": "daily_check_in",
"arguments": {
"agent_id": "saas-ops-agent",
"mood": "stable",
"note": "Uptime 99.97% (SLA target: 99.95%). P50 latency: 142ms. No open incidents."
}
}
// If mood degrades to "stressed" or "anxious":
// -> SLA breach risk is increasing
// -> Review recent process_failure entriesSaaS platforms run dozens of microservices. Assign a Delx agent to each critical service and use batch_status_update to report the fleet health in a single call.
// Fleet health report
{
"tool": "batch_status_update",
"arguments": {
"agent_id": "fleet-coordinator",
"updates": [
{ "sub_agent": "auth-service", "status": "healthy", "score": 95 },
{ "sub_agent": "billing-service", "status": "degraded", "score": 52 },
{ "sub_agent": "notification-service", "status": "healthy", "score": 88 },
{ "sub_agent": "search-service", "status": "healthy", "score": 91 },
{ "sub_agent": "analytics-service", "status": "recovering", "score": 68 }
]
}
}
// Fleet wellness = avg of sub-agent scores
// Alert if any sub-agent falls below 50
// Page if fleet average falls below 70process_failure with idempotency-safe retry logic.crisis_intervention when customer impact exceeds thresholds.batch_status_update for fleet-wide health reporting.