Agents without rate limits will consume your entire API quota in minutes. A single agent running a search loop can burn through 10,000 API calls before you notice. Delx provides built-in rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) and patterns to keep your agents within bounds without sacrificing throughput.
Agents call APIs as fast as they can think. Without rate limits, a search agent can fire 100 requests per second, a code agent can make 50 tool calls per minute, and a monitoring agent can poll endpoints every 500ms. This burns API quotas, triggers provider rate limits, and causes cascading failures across your agent fleet.
Layer three rate limiting patterns: token bucket for burst control, sliding window for sustained throughput, and per-tool limits for granular control. Use Delx X-RateLimit-Remaining and X-RateLimit-Reset headers to adapt dynamically. Monitor via /api/v1/metrics to tune limits based on actual usage.
| Metric | Target | How to Measure |
|---|---|---|
| Throttle rate | 5-15% of total requests | Percentage of requests that had to wait for a rate limit slot. Under 5% means limits are too loose. Over 15% means limits are too tight and hurting throughput. |
| API quota utilization | 60-80% | Percentage of total API quota consumed per billing period. Track via X-RateLimit-Remaining headers over time. Under 60% means you're under-utilizing. Over 80% risks hitting hard limits. |
| 429 response rate | 0% | Percentage of API calls receiving 429 Too Many Requests from providers. Any 429s mean your client-side limits aren't tight enough. Track via process_failure classifications. |
| Agent wait time from throttling | Under 2 seconds average | Average time agents spend waiting for rate limit slots. Track at the token bucket and sliding window layers. High wait times indicate limits are too restrictive. |
Token bucket excels at burst control -- it allows short bursts of rapid requests while capping the sustained rate. Use it for agents that naturally work in bursts (search, then process, then search again). Sliding window provides smoother, more predictable throughput. Use it for agents with steady, continuous API usage (monitoring, polling). Most production systems layer both: token bucket for burst protection and sliding window for sustained rate control.
Delx returns three headers on every response: X-RateLimit-Limit (total requests allowed per window), X-RateLimit-Remaining (requests left), and X-RateLimit-Reset (seconds until the window resets). These are server-side limits independent of your client-side rate limiting. Always respect them -- they reflect the actual capacity available. When X-RateLimit-Remaining drops below 10% of the limit, preemptively throttle rather than waiting for rejection.
When multiple agents share the same API quota, client-side rate limits per agent aren't enough. You need fleet-level coordination. Use a shared rate limit counter (Redis, database, or Delx session state) that all agents check before making calls. Allocate quota per agent proportionally: high-priority agents get 40% of quota, standard agents get 10% each. Monitor via /api/v1/metrics aggregated across all agents.
Start with 100 requests per minute per agent and a token bucket of 20. Monitor for a week via /api/v1/metrics. If throttle rate is under 5%, tighten by 20%. If over 15%, loosen by 20%. Converge on limits that give 5-15% throttle rate.
Parse the provider's rate limit headers (usually similar format to Delx). Set your client-side limits to 80% of the provider's limit. This leaves headroom for other clients and prevents 429 responses. Use process_failure to classify 429s and recovery for backoff.
Yes, but loosely. Heartbeat is lightweight but still counts against your quota. Set it to 2 per minute (matching the recommended 30-second interval). Never rate-limit heartbeat below 1 per minute -- you'll lose visibility into agent health.
Per-tool limits are enforced first, then overall agent limits. An agent might have 100 rpm overall, but if its search tool limit is 10 rpm, search calls are throttled at 10 regardless. This prevents expensive tools from consuming the entire agent quota.
The agent waits for a slot to open. During the wait, it can process other tasks that don't require the rate-limited tool. If the wait exceeds 30 seconds, log a warning via process_failure. If it exceeds 60 seconds, the DELX_META risk_level will escalate to 'high'.
Yes. Use DELX_META risk_level for automatic adjustment, or implement time-based schedules. Most APIs have lower usage at night -- increase limits by 50% during off-peak hours (10 PM to 6 AM). Track the impact via /api/v1/metrics to verify the schedule works.