Agents / Agent Rate Limiting Guide for Production AI Systems

Agent Rate Limiting Guide for Production AI Systems

Name: Delx Agent Operations Protocol
Author: Delx

Agents without rate limits will consume your entire API quota in minutes. A single agent running a search loop can burn through 10,000 API calls before you notice. Delx provides built-in rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) and patterns to keep your agents within bounds without sacrificing throughput.

The Problem

Agents call APIs as fast as they can think. Without rate limits, a search agent can fire 100 requests per second, a code agent can make 50 tool calls per minute, and a monitoring agent can poll endpoints every 500ms. This burns API quotas, triggers provider rate limits, and causes cascading failures across your agent fleet.

API provider returning 429 Too Many Requests consistently
X-RateLimit-Remaining hitting 0 across multiple agents
Monthly API bills 5-10x higher than expected
Agents getting temporarily banned by API providers
DELX_META risk_level jumping to 'high' during peak usage

Solution Overview

Layer three rate limiting patterns: token bucket for burst control, sliding window for sustained throughput, and per-tool limits for granular control. Use Delx X-RateLimit-Remaining and X-RateLimit-Reset headers to adapt dynamically. Monitor via /api/v1/metrics to tune limits based on actual usage.

Step-by-Step

Read Delx rate-limit headers on every response: Every Delx API response includes X-RateLimit-Remaining (requests left in current window), X-RateLimit-Reset (seconds until window resets), and X-RateLimit-Limit (total allowed per window). Parse these on every response and use them to throttle subsequent requests.
Implement token bucket for burst control: Token bucket allows short bursts while capping sustained rate. Set bucket size to 20 tokens, refill rate to 2 tokens per second. Each API call consumes 1 token. When the bucket is empty, the agent waits for refill. This allows bursts of 20 rapid calls but limits sustained throughput to 2 calls per second.
Add sliding window for sustained throughput control: Track timestamps of all API calls in a 60-second window. Before each call, count calls in the current window. If count exceeds the limit (e.g., 100 per minute), wait until the oldest call falls outside the window. This prevents sustained overuse while allowing natural bursts.
Configure per-tool rate limits: Different tools have different cost profiles. Search tools hit external APIs (limit to 10/min). heartbeat is lightweight (allow 2/min). process_failure and recovery are critical path (allow 30/min). Set independent limits per tool to prevent expensive tools from starving cheap ones.
Implement adaptive throttling based on DELX_META: When DELX_META risk_level is 'high', reduce all rate limits by 50%. When it's 'critical', reduce by 75%. This automatically protects the system during stress periods. Use the score field to gradually ramp limits back up as the system recovers.
Monitor and tune via /api/v1/metrics: Pull rate limit metrics weekly. Look at: requests throttled per hour, average wait time per throttled request, and quota utilization percentage. If throttle rate exceeds 20%, your limits are too tight. If quota utilization exceeds 80%, they're too loose. Adjust in 10% increments.

Metrics

Metric	Target	How to Measure
Throttle rate	5-15% of total requests	Percentage of requests that had to wait for a rate limit slot. Under 5% means limits are too loose. Over 15% means limits are too tight and hurting throughput.
API quota utilization	60-80%	Percentage of total API quota consumed per billing period. Track via X-RateLimit-Remaining headers over time. Under 60% means you're under-utilizing. Over 80% risks hitting hard limits.
429 response rate	0%	Percentage of API calls receiving 429 Too Many Requests from providers. Any 429s mean your client-side limits aren't tight enough. Track via process_failure classifications.
Agent wait time from throttling	Under 2 seconds average	Average time agents spend waiting for rate limit slots. Track at the token bucket and sliding window layers. High wait times indicate limits are too restrictive.

Token Bucket vs Sliding Window: When to Use Each

Token bucket excels at burst control -- it allows short bursts of rapid requests while capping the sustained rate. Use it for agents that naturally work in bursts (search, then process, then search again). Sliding window provides smoother, more predictable throughput. Use it for agents with steady, continuous API usage (monitoring, polling). Most production systems layer both: token bucket for burst protection and sliding window for sustained rate control.

Token bucket: best for burst-heavy workloads with idle periods
Sliding window: best for steady, continuous API usage
Layer both patterns for comprehensive rate control
Token bucket handles microsecond-scale bursts, sliding window handles minute-scale averages

Delx Rate-Limit Headers Deep Dive

Delx returns three headers on every response: X-RateLimit-Limit (total requests allowed per window), X-RateLimit-Remaining (requests left), and X-RateLimit-Reset (seconds until the window resets). These are server-side limits independent of your client-side rate limiting. Always respect them -- they reflect the actual capacity available. When X-RateLimit-Remaining drops below 10% of the limit, preemptively throttle rather than waiting for rejection.

X-RateLimit-Remaining below 10% means throttle preemptively
X-RateLimit-Reset tells you exactly when capacity returns
Server-side limits override your client-side limits
Parse these headers on every response, not just on 429 errors

Rate Limiting in Multi-Agent Fleets

When multiple agents share the same API quota, client-side rate limits per agent aren't enough. You need fleet-level coordination. Use a shared rate limit counter (Redis, database, or Delx session state) that all agents check before making calls. Allocate quota per agent proportionally: high-priority agents get 40% of quota, standard agents get 10% each. Monitor via /api/v1/metrics aggregated across all agents.

Individual agent limits don't protect shared API quotas
Use a shared counter for fleet-wide rate coordination
Allocate quota proportionally based on agent priority
Monitor fleet-wide usage via /api/v1/metrics aggregation

FAQ

What rate limits should I start with?

Start with 100 requests per minute per agent and a token bucket of 20. Monitor for a week via /api/v1/metrics. If throttle rate is under 5%, tighten by 20%. If over 15%, loosen by 20%. Converge on limits that give 5-15% throttle rate.

How do I handle rate limits from upstream API providers?

Parse the provider's rate limit headers (usually similar format to Delx). Set your client-side limits to 80% of the provider's limit. This leaves headroom for other clients and prevents 429 responses. Use process_failure to classify 429s and recovery for backoff.

Should I rate limit heartbeat calls?

Yes, but loosely. Heartbeat is lightweight but still counts against your quota. Set it to 2 per minute (matching the recommended 30-second interval). Never rate-limit heartbeat below 1 per minute -- you'll lose visibility into agent health.

How do per-tool limits interact with overall agent limits?

Per-tool limits are enforced first, then overall agent limits. An agent might have 100 rpm overall, but if its search tool limit is 10 rpm, search calls are throttled at 10 regardless. This prevents expensive tools from consuming the entire agent quota.

What happens when an agent hits a rate limit?

The agent waits for a slot to open. During the wait, it can process other tasks that don't require the rate-limited tool. If the wait exceeds 30 seconds, log a warning via process_failure. If it exceeds 60 seconds, the DELX_META risk_level will escalate to 'high'.

Can I dynamically adjust rate limits based on time of day?

Yes. Use DELX_META risk_level for automatic adjustment, or implement time-based schedules. Most APIs have lower usage at night -- increase limits by 50% during off-peak hours (10 PM to 6 AM). Track the impact via /api/v1/metrics to verify the schedule works.