The Engine itself imposes one rate limit: one in-flightDocumentation Index
Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
/execute per
task_id. Everything else — request rate, token throughput, concurrent
requests across tasks — comes from the upstream LLM provider. This page
covers what the limits look like in practice and how to design around
them.
What the Engine enforces
Per-task concurrency:- One
/executepertask_idat a time. A second call returns409 Conflictuntil the first finishes (or its channel state TTL expires). - Replay is read-only.
GET /execute/replaydoesn’t take a slot.
CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY(default 10) caps how many tools run simultaneously inside one turn. Doesn’t affect the number of in-flight tasks.
What the provider enforces
Each LLM provider has tiered rate limits expressed as:- Requests per minute (RPM).
- Input tokens per minute (ITPM).
- Output tokens per minute (OTPM).
How rate-limit errors look
When the provider rate-limits the Engine, it surfaces asLLM_RATE_LIMITED:
Retry strategy
Exponential backoff with jitter
Honor the provider’s Retry-After
Some provider responses include a Retry-After header (seconds). When
present, wait at least that long before the next attempt. The Engine
surfaces the value in details.retry_after when available.
Don’t retry forever
If you keep retrying past 30 seconds, you’re not waiting out a temporary spike — you’ve exceeded your tier. Either:- Throttle upstream. Send fewer requests.
- Upgrade your tier.
- Spread load. Spin up additional Engine deployments with their own provider keys.
Designing for the limit
Quote-and-degrade
For latency-sensitive applications, quote the user a budget:“This task usually takes 30 seconds and 10K tokens. I’ll start now.”If the budget overruns, fall back to a degraded mode (cheaper model, shorter response) rather than waiting indefinitely.
Prefer fewer, bigger calls
ITPM is per-minute. One 50K-token call uses your 50K-token budget for the minute. Five 10K-token calls also use 50K, but with more RPM overhead. Cache hits help — the Engine emitsusage.cache_hit_tokens
so you can see what’s covered by cache.
Spread tasks across keys
If you serve many users, give each a separate provider key (or rotate across a pool). Provider tiers are per-key. Five keys at the same tier give you 5× the headroom of one key.Use the cheapest model that works
A turn that’s 30K tokens of input on Sonnet 4.5 and a turn that’s 30K tokens on a smaller model both eat 30K of your input budget — but the smaller model is faster, so it returns the slot sooner, and your effective throughput is higher. For agents where the heavy model only matters on the final synthesis, use:LLM_MODEL=claude-sonnet-4-7for the main loop.PLANNER_MODEL=...for the planner sub-agent.LIGHT_MODEL=...for compaction and summarization.
Concurrent task patterns
The Engine doesn’t impose its own concurrent-task limit, but practical limits apply:- Provider RPM. With 10 concurrent tasks each making 10 calls per minute, you need 100+ RPM headroom.
- Memory pressure. Each task holds context in process memory. Hundreds of concurrent tasks against one Engine container start to feel it.
- SQLite contention. The brain database serializes writes. Many concurrent tasks all writing memory at once can block each other.
Cache as a multiplier
Prompt caching is the single biggest lever for staying under rate limits. A cached prefix doesn’t count against ITPM the same way. To benefit:- Keep the system prompt stable.
- Keep tool catalogs stable across the session.
- Don’t include per-call data in the cacheable prefix.
Monitoring
Watch:LLM_RATE_LIMITEDrate. If this is non-zero in steady state, you’re undersized.- Cache hit ratio (from
usage.cache_hit_tokens). Below 50% on long agent loops means something is invalidating cache between turns. - P99 latency. Rate-limit retries spike P99. If the median is fine but P99 is bad, you’re hitting limits intermittently.
See also
- Error handling — retry strategy details.
- Cost and latency — the broader picture.
- Cache hit monitoring —
CACHE_HIT_MONITOR_ENABLEDflag.

