Skip to main content

Documentation Index

Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The Engine itself imposes one rate limit: one in-flight /execute per task_id. Everything else — request rate, token throughput, concurrent requests across tasks — comes from the upstream LLM provider. This page covers what the limits look like in practice and how to design around them.

What the Engine enforces

Per-task concurrency:
  • One /execute per task_id at a time. A second call returns 409 Conflict until the first finishes (or its channel state TTL expires).
  • Replay is read-only. GET /execute/replay doesn’t take a slot.
Per-deployment concurrency:
  • CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY (default 10) caps how many tools run simultaneously inside one turn. Doesn’t affect the number of in-flight tasks.
Aside from those, the Engine accepts as many concurrent tasks as your provider quota allows.

What the provider enforces

Each LLM provider has tiered rate limits expressed as:
  • Requests per minute (RPM).
  • Input tokens per minute (ITPM).
  • Output tokens per minute (OTPM).
You’ll hit ITPM first on most agentic workloads — agents have long contexts. RPM matters more for high-fan-out, short-context use cases. OTPM is rarely the binding constraint. Specific numbers depend on your provider tier. Check the provider’s console.

How rate-limit errors look

When the provider rate-limits the Engine, it surfaces as LLM_RATE_LIMITED:
{ "error": "LLM_RATE_LIMITED", "message": "Provider rate limit exceeded. Retry after N seconds." }
In an SSE stream, you’ll see:
event: error
data: { "error": "LLM_RATE_LIMITED", ... }
The stream then closes.

Retry strategy

Exponential backoff with jitter

def call_with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimited:
            if attempt == max_attempts - 1:
                raise
            base = 2 ** attempt
            jitter = random.uniform(0, 1)
            time.sleep(base + jitter)
Start at 1 second, double each attempt, add jitter to avoid retry synchronization. Five attempts ≈ 30 seconds total. After that, surface to the user.

Honor the provider’s Retry-After

Some provider responses include a Retry-After header (seconds). When present, wait at least that long before the next attempt. The Engine surfaces the value in details.retry_after when available.

Don’t retry forever

If you keep retrying past 30 seconds, you’re not waiting out a temporary spike — you’ve exceeded your tier. Either:
  • Throttle upstream. Send fewer requests.
  • Upgrade your tier.
  • Spread load. Spin up additional Engine deployments with their own provider keys.

Designing for the limit

Quote-and-degrade

For latency-sensitive applications, quote the user a budget:
“This task usually takes 30 seconds and 10K tokens. I’ll start now.”
If the budget overruns, fall back to a degraded mode (cheaper model, shorter response) rather than waiting indefinitely.

Prefer fewer, bigger calls

ITPM is per-minute. One 50K-token call uses your 50K-token budget for the minute. Five 10K-token calls also use 50K, but with more RPM overhead. Cache hits help — the Engine emits usage.cache_hit_tokens so you can see what’s covered by cache.

Spread tasks across keys

If you serve many users, give each a separate provider key (or rotate across a pool). Provider tiers are per-key. Five keys at the same tier give you 5× the headroom of one key.

Use the cheapest model that works

A turn that’s 30K tokens of input on Sonnet 4.5 and a turn that’s 30K tokens on a smaller model both eat 30K of your input budget — but the smaller model is faster, so it returns the slot sooner, and your effective throughput is higher. For agents where the heavy model only matters on the final synthesis, use:
  • LLM_MODEL=claude-sonnet-4-7 for the main loop.
  • PLANNER_MODEL=... for the planner sub-agent.
  • LIGHT_MODEL=... for compaction and summarization.

Concurrent task patterns

The Engine doesn’t impose its own concurrent-task limit, but practical limits apply:
  • Provider RPM. With 10 concurrent tasks each making 10 calls per minute, you need 100+ RPM headroom.
  • Memory pressure. Each task holds context in process memory. Hundreds of concurrent tasks against one Engine container start to feel it.
  • SQLite contention. The brain database serializes writes. Many concurrent tasks all writing memory at once can block each other.
For high-fan-out workloads, scale by running more Engines (each with their own brain), not by stuffing more tasks into one.

Cache as a multiplier

Prompt caching is the single biggest lever for staying under rate limits. A cached prefix doesn’t count against ITPM the same way. To benefit:
  • Keep the system prompt stable.
  • Keep tool catalogs stable across the session.
  • Don’t include per-call data in the cacheable prefix.
A well-cached agent can run 5–10× more turns per ITPM budget than a poorly-cached one.

Monitoring

Watch:
  • LLM_RATE_LIMITED rate. If this is non-zero in steady state, you’re undersized.
  • Cache hit ratio (from usage.cache_hit_tokens). Below 50% on long agent loops means something is invalidating cache between turns.
  • P99 latency. Rate-limit retries spike P99. If the median is fine but P99 is bad, you’re hitting limits intermittently.
For instrumentation, see Observability.

See also