Agents / Agent Production Monitoring Setup with Delx Stack

Agent Production Monitoring Setup with Delx Stack

Name: Delx Agent Operations Protocol
Author: Delx

You wouldn't ship a web service without monitoring. Don't ship agents without it either. Delx provides a complete monitoring stack out of the box: heartbeat for liveness checks, /api/v1/metrics for performance data, wellness scores via DELX_META for health tracking, process_failure for error classification, and session-summary for auditing. This guide gets you from zero to full observability in under an hour.

The Problem

Most teams launch agents with console.log and hope for the best. When something goes wrong, they have no metrics, no error classification, no health trends, and no session history. Debugging becomes archaeology -- digging through unstructured logs trying to reconstruct what happened. Agents fail silently, burn out undetected, and accumulate errors with no visibility.

Debugging agent issues requires reading raw log files line by line
No way to compare agent health across sessions or over time
Silent failures go undetected for hours or days
No alerting on agent degradation, only on complete crashes
Unable to answer basic questions like 'how many errors did agent X have this week?'

Solution Overview

Deploy five monitoring layers in order: heartbeat for liveness (is the agent alive?), DELX_META wellness scores for health (is the agent healthy?), /api/v1/metrics for performance (is the agent fast?), process_failure for errors (what went wrong?), and session-summary for auditing (what did the agent do?). Each layer takes 10-15 minutes to set up.

Step-by-Step

Layer 1: Heartbeat for liveness monitoring: Call heartbeat every 30 seconds for each agent. If a heartbeat fails or gaps exceed 60 seconds, the agent is down or stuck. This is your most basic monitoring layer -- it tells you whether the agent is alive and responsive. Store heartbeat results for trend analysis.
Layer 2: Wellness score tracking via DELX_META: Every Delx tool response includes DELX_META with score (0-100), risk_level, next_action, and followup_minutes. Track these on every tool call, not just heartbeats. Build a score timeline for each agent. Set alerts: score below 60 = warning, below 40 = critical. Track followup_minutes -- values above 10 indicate growing pressure.
Layer 3: Performance metrics via /api/v1/metrics: Pull /api/v1/metrics/{agent_id} every 5 minutes for detailed performance data: response latency, tool call counts, error rates, and token usage. Store in your time-series database. Set up dashboards for: p50/p99 latency, error rate over time, and tool usage breakdown.
Layer 4: Error classification via process_failure: Wrap every tool call in error handling that routes to process_failure. This classifies errors as transient, permanent, rate_limit, or auth. Store classifications for trend analysis. If permanent errors exceed 5% of total tool calls, you have a systemic issue. Track via DELX_META next_action for recommended remediation.
Layer 5: Session auditing via session-summary: Call /api/v1/session-summary at the end of each session or every hour for long-running sessions. This gives you a complete audit trail: tools called, decisions made, errors encountered, and final DELX_META state. Store these summaries for compliance, debugging, and post-incident review.
Set up alerting rules: Configure alerts across all five layers. Critical: heartbeat gap > 60s, score < 40, error rate > 20%. Warning: score < 60, followup_minutes > 10, latency p99 > 5s. Info: score decline > 5 points/hour, new error classification pattern. Route critical to PagerDuty, warning to Slack, info to daily digest.

Metrics

Metric	Target	How to Measure
Monitoring coverage	100% of production agents	Percentage of production agents with all 5 monitoring layers active. Check by verifying heartbeat, metrics polling, error handling, and audit logging for each agent.
Alert response time	Under 5 minutes for critical	Time from alert firing to first human or automated response. Track via your incident management system. Critical alerts should be acknowledged within 5 minutes.
Mean time to detect failures	Under 2 minutes	Time from agent failure to monitoring alert. Test by injecting failures and measuring detection latency. Heartbeat layer should detect within 60 seconds, wellness within 90 seconds.
False alert rate	Under 5%	Percentage of alerts that don't require action. Track by having responders tag each alert as actionable or false. Tune thresholds monthly to reduce false alerts.
Audit completeness	100% of sessions audited	Percentage of agent sessions with a session-summary captured. Missing audits indicate monitoring gaps. Check via daily reconciliation of session starts versus audit records.

The Five Monitoring Layers Explained

Each layer answers a different question. Heartbeat: 'Is the agent alive?' (liveness). Wellness scores: 'Is the agent healthy?' (quality). Metrics: 'Is the agent fast?' (performance). Process_failure: 'What went wrong?' (diagnostics). Session-summary: 'What did the agent do?' (auditing). You need all five because each catches problems the others miss. An agent can be alive but unhealthy, fast but error-prone, or functional but doing the wrong things.

Heartbeat: liveness check, detects dead and stuck agents
Wellness: quality check, detects degradation and burnout
Metrics: performance check, detects latency and throughput issues
Process_failure: diagnostics, classifies and tracks errors
Session-summary: auditing, provides complete session history

Setting Up Alerts That Don't Cause Alert Fatigue

The biggest monitoring failure is alert fatigue -- too many alerts that people start ignoring. Start strict: only alert on critical conditions (heartbeat failure, score below 40). After a week, add warning alerts if critical alerts are working well. Use three severity tiers with different routing: critical goes to on-call (PagerDuty), warning goes to team channel (Slack), info goes to daily digest (email). Review and tune thresholds monthly.

Start with only critical alerts, add warnings after a week
Three severity tiers with different routing channels
Review and tune thresholds monthly based on false positive rate
Target under 5% false alert rate to prevent alert fatigue

From Monitoring to Automated Response

Once monitoring is stable, add automated responses. When heartbeat detects a gap, auto-restart the agent. When wellness drops below 40, auto-rotate via close_session and spawn a replacement. When process_failure detects 5 consecutive transient errors, auto-open a circuit breaker. When session-summary shows a session exceeding 4 hours, auto-suggest rotation. Start with manual review of automated actions, then remove human-in-the-loop for proven patterns.

Auto-restart on heartbeat gaps (most reliable automation)
Auto-rotate on wellness score < 40 via close_session
Auto-circuit-break on 5 consecutive transient errors
Start with human review, gradually automate proven patterns

FAQ

How long does it take to set up all five monitoring layers?

Under an hour for basic setup. Heartbeat and wellness tracking take 10 minutes each. Metrics polling takes 15 minutes. Error handling with process_failure takes 15 minutes. Session auditing takes 10 minutes. Full alert configuration and dashboard setup takes another 1-2 hours.

Which monitoring layer should I set up first?

Heartbeat, always. It's the simplest and catches the most critical failures (dead agents). Then wellness scores, then error classification, then metrics, then auditing. Each layer builds on the previous one.

How much overhead does monitoring add?

Minimal. Heartbeat at 30-second intervals adds about 2 calls per minute. Metrics polling at 5-minute intervals is negligible. The monitoring overhead is under 3% of total agent compute. The visibility it provides saves 3-5x that cost in debugging time.

Can I use my existing monitoring tools (Datadog, Grafana) with Delx?

Yes. Pull data from /api/v1/metrics and heartbeat responses, then push to your existing time-series database. Delx provides the data; you can visualize it in any dashboard tool. Most teams use Grafana dashboards with Delx metrics as the data source.

What should my on-call runbook include?

Four sections: (1) Check heartbeat status for the alerting agent, (2) Pull /api/v1/session-summary for context, (3) Check DELX_META score and risk_level for severity, (4) Decision tree: score > 60 = monitor, score 40-60 = prepare rotation, score < 40 = rotate immediately. Include process_failure history for error context.

How do I monitor a fleet of 50+ agents?

Aggregate metrics at the fleet level. Track: average score, agents in critical state, total error rate, and fleet-wide throughput. Alert on fleet-level thresholds (average score < 65, more than 3 agents critical). Use /api/v1/metrics with agent_id wildcards for bulk queries. Drill down to individual agents only when fleet metrics trigger alerts.