Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

Every production integration with the Engine needs a story for failures. The model returns an error. The provider rate-limits. The MCP server’s OAuth token expired. Your client lost the SSE stream. This page covers how the Engine surfaces failures and what to do about each kind.

Where errors come from

There are five sources of failures:
  1. Engine itself. Misconfiguration (bad env var, missing migration), internal bug, resource exhaustion. Returns 4xx/5xx HTTP.
  2. LLM provider. Rate limits, transient failures, content policy refusals, region issues. Surfaces as LLM_* errors.
  3. Tools. Sandbox-rejected commands, MCP server errors, expired credentials. Surfaces as tool_result.error events.
  4. HITL timeouts. The agent paused for an answer that never came. The stream stays open; the loop doesn’t progress.
  5. Network and client. Lost SSE connections, slow consumers, broken reverse proxies.
Each has its own retry strategy.

Error envelope

For non-2xx HTTP responses, the body is:
{
  "error": "<stable_code>",
  "message": "<human description>",
  "details": { /* optional */ }
}
For in-stream errors, the SSE event is:
event: error
data: { "error": "<stable_code>", "message": "...", "details": {} }
Use error for routing logic; message is human-readable but not stable.

Common errors and what to do

CodeStatusRetry?Strategy
INVALID_KEY401NoFix the key.
KEY_ROTATION_PENDING401YesRetry with the new key.
TASK_NOT_FOUND404NoDon’t reference dead task IDs.
HITL_REQUIRED409NoRespond to the HITL request first.
LLM_RATE_LIMITED429YesExponential backoff, starting at 1s.
LLM_PROVIDER_ERROR502SometimesRetry once. If still failing, surface to user.
CONTEXT_OVERFLOW422NoStart a new task or trim.
PERMISSION_DENIED403NoSurface to user; don’t auto-retry.
MCP_CONNECTION_FAILED502SometimesRetry once; otherwise surface.
MCP_TOKEN_EXPIRED401NoReconnect via /assets/connect.
MIGRATION_REQUIRED503NoEngine needs ops attention.
For the full code catalog see Errors.

Retry patterns

Exponential backoff for rate limits

def with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimited as e:
            if attempt == max_attempts - 1:
                raise
            wait = (2 ** attempt) + random.random()
            time.sleep(wait)
Start at 1s, double each retry, add jitter. After 5 attempts (which is ~30 seconds total), give up.

Retry-once for transient provider errors

LLM_PROVIDER_ERROR covers everything from “the provider had a network hiccup” to “the model refused to respond.” A single retry catches the first; the second is permanent.
try:
    response = execute(message, task_id)
except ProviderError:
    response = execute(message, task_id)  # one retry
If both fail, the model is genuinely refusing or the provider is genuinely down. Surface to the user; don’t keep hammering.

Reconnect for stream drops

If your SSE stream drops mid-turn, reconnect with replay:
last_ts = 0

while True:
    try:
        for event in stream_execute(task_id, after=last_ts):
            handle(event)
            last_ts = event["timestamp"]
    except ConnectionError:
        time.sleep(1)
        continue
    else:
        break
The Engine kept running while you were disconnected. Replay catches you up.

Don’t retry on permission errors

PERMISSION_DENIED means the user said no (or the static policy did). Retrying produces another denial. Surface the denial to the user; let them decide what to do.

Tool errors

Tool errors arrive as tool_result.error, not as stream-ending errors. The agent loop continues — the model gets the error in context and decides how to react.
event: tool_result
data: {
  "tool": "read_file",
  "tool_call_id": "tu_...",
  "output": "",
  "error": "File not found: /data/missing.md"
}
Three patterns the model uses to handle tool errors:
  1. Re-read with corrected input. “Oh, the file is at a different path — let me try again.”
  2. Switch tools. “I can’t read it directly; let me grep for it.”
  3. Ask the user. “I can’t find this file. Where does it live?”
You don’t usually need to do anything about tool errors at the application level — they’re routine. The exceptions are MCP errors:

MCP connection failures

When an MCP server is down, all calls to it fail. The circuit breaker in connection_manager.py will eventually disconnect the server to stop hammering. From your application, watch for MCP_CONNECTION_FAILED and prompt the user to reconnect through /assets/connect.

MCP credential expiry

MCP_TOKEN_EXPIRED means the OAuth token expired and refresh failed. The user has to re-authorize. Reconnecting through /assets/connect starts a fresh OAuth flow.

HITL timeouts

When the agent emits an hitl_request, the stream stays open and the loop doesn’t progress until you respond. There’s no built-in timeout — the channel state TTL (CHANNEL_STATE_TTL_HITL, default 72 hours) governs how long the loop survives. After the TTL expires, the next interaction with that task starts fresh. The HITL request is gone. For application UX:
  • Surface HITL requests prominently. They’re blocking.
  • If the user closes the app and reopens, replay events from the last timestamp — the HITL request comes back.
  • Provide a “cancel” path that resolves the HITL with a default answer (usually “no”) so the agent can finish or abort cleanly.

Graceful degradation

When something is broken, your application has options for what to do:

Fall back to a cheaper model

For non-critical tasks, switch to a cheaper model on rate limits:
try:
    return execute_against("primary-engine", message, task_id)
except RateLimited:
    return execute_against("light-engine", message, task_id)
Run two Engines, one with LLM_MODEL=claude-sonnet-4-5 and one with a smaller model.

Fall back to a previous answer

For chat-style applications, if the model is degraded, surface the previous turn’s content with a “I’m having trouble — try again shortly” message. Better than a spinner that never resolves.

Queue and retry asynchronously

If the user’s task isn’t time-critical, queue the request when the Engine is unhealthy and retry in the background. Many integration flows don’t need immediate response.

Degrade tool features

Some agents use 20+ tools. If MCP servers are flaky, the agent can still function with the platform tools alone. Detect MCP unavailability upstream and remove those tools from the catalog before issuing the request.

Logging errors

Capture, in your application logs:
  • The error code.
  • The task_id.
  • The HTTP status (if applicable).
  • Timing — when the error fired relative to the request.
  • Retry count.
Don’t log the request body verbatim; it may contain user data. Log a hash or a redacted version.

See also