Long-running agent sessions silently degrade when context windows fill up. The agent doesn't crash -- it just starts forgetting instructions, dropping tool results, and producing lower quality output. These 5 patterns keep your agents sharp across sessions lasting hours or days.
Context windows have hard token limits. As sessions grow, early instructions and tool results get pushed out. Agents lose their system prompt context, forget previous decisions, and start contradicting earlier outputs. Most teams don't notice until quality has already tanked.
Apply five complementary patterns: session compaction removes redundant turns, sliding window keeps only recent context, summary checkpoints preserve key decisions, token budgeting with Delx tools enforces limits proactively, and session splitting creates fresh contexts for distinct tasks.
| Metric | Target | How to Measure |
|---|---|---|
| Score degradation rate | Under 5 points per hour | Track DELX_META score from heartbeat over time. Calculate hourly delta. Anything above 5 points/hour means context is degrading too fast. |
| Duplicate tool call rate | Under 3% | Count tool calls with identical parameters within the same session. Pull from /api/v1/session-summary. Over 3% means the agent is losing context. |
| Context utilization efficiency | Above 70% | Ratio of unique information tokens to total context tokens. Measure after compaction cycles. Below 70% means too much redundant context. |
| Session split frequency | 1 split per 2 hours | Count close_session calls with reason 'task_transition'. Too frequent means tasks are too granular. Too rare means context is getting bloated. |
Don't wait for output quality to drop. Monitor two leading indicators: DELX_META score trend and duplicate tool call rate. When score drops 10+ points from session start, context is getting stale. When the agent calls the same tool with the same parameters twice in 5 minutes, it's already lost important context. Set up alerts on both via /api/v1/metrics.
Not all patterns apply to every agent type. Short-lived task agents (under 30 minutes) need only token budgeting. Long-running monitor agents need sliding window plus checkpoints. Multi-phase pipeline agents benefit most from session splitting. Start with token budgeting as a baseline, then layer patterns based on session duration and task complexity.
The close_session tool with preserve_summary=true generates a compressed handoff document. This document includes key decisions, active tasks, accumulated state, and the current DELX_META snapshot. Pass this to the new session's system prompt. The new agent starts with full context of what happened but none of the bloat from intermediate tool results.
Assign every tool a token budget based on how much context its output actually contributes. Search results rarely need more than 500 tokens -- the top 3 results are usually enough. Code blocks need more (1000 tokens) because truncation breaks syntax. API responses can be aggressively truncated to 200 tokens. Enforce budgets at the tool response layer, before context assembly.
Two reliable signals: DELX_META score drops 10+ points from session start, and the agent makes duplicate tool calls for data it already has. Monitor both via heartbeat and /api/v1/session-summary.
No. Start with token budgeting (universal baseline). Add sliding window if sessions exceed 1 hour. Add checkpoints if sessions exceed 2 hours. Add session splitting for multi-phase tasks. Add compaction only if tool call volume is very high.
Every 10 tool calls is a good default. If your agent makes fewer than 5 tool calls per hour, compact every 30 minutes instead. The key metric is context utilization efficiency -- compact when it drops below 70%.
20 turns is a solid default. Adjust based on your model's context window: use 15 turns for 8K models, 30 turns for 32K models, 50 turns for 128K+ models. Always keep system prompt and critical instructions anchored outside the window.
Checkpoints compress context within a single session. Session splitting creates an entirely new session with a clean context window. Use checkpoints for continuous tasks and splitting for distinct task transitions.
score (overall health, drops indicate degradation), followup_minutes (increasing values mean context pressure), risk_level (goes to 'high' when context is critically full), and next_action (will suggest 'compact_session' or 'split_session' when appropriate).