Every production integration with the Engine needs a story for failures. The model returns an error. The provider rate-limits. The MCP server’s OAuth token expired. Your client lost the SSE stream. This page covers how the Engine surfaces failures and what to do about each kind.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
Where errors come from
There are five sources of failures:- Engine itself. Misconfiguration (bad env var, missing migration), internal bug, resource exhaustion. Returns 4xx/5xx HTTP.
- LLM provider. Rate limits, transient failures, content policy
refusals, region issues. Surfaces as
LLM_*errors. - Tools. Sandbox-rejected commands, MCP server errors, expired
credentials. Surfaces as
tool_result.errorevents. - HITL timeouts. The agent paused for an answer that never came. The stream stays open; the loop doesn’t progress.
- Network and client. Lost SSE connections, slow consumers, broken reverse proxies.
Error envelope
For non-2xx HTTP responses, the body is:error for routing logic; message is human-readable but not
stable.
Common errors and what to do
| Code | Status | Retry? | Strategy |
|---|---|---|---|
INVALID_KEY | 401 | No | Fix the key. |
KEY_ROTATION_PENDING | 401 | Yes | Retry with the new key. |
TASK_NOT_FOUND | 404 | No | Don’t reference dead task IDs. |
HITL_REQUIRED | 409 | No | Respond to the HITL request first. |
LLM_RATE_LIMITED | 429 | Yes | Exponential backoff, starting at 1s. |
LLM_PROVIDER_ERROR | 502 | Sometimes | Retry once. If still failing, surface to user. |
CONTEXT_OVERFLOW | 422 | No | Start a new task or trim. |
PERMISSION_DENIED | 403 | No | Surface to user; don’t auto-retry. |
MCP_CONNECTION_FAILED | 502 | Sometimes | Retry once; otherwise surface. |
MCP_TOKEN_EXPIRED | 401 | No | Reconnect via /assets/connect. |
MIGRATION_REQUIRED | 503 | No | Engine needs ops attention. |
Retry patterns
Exponential backoff for rate limits
Retry-once for transient provider errors
LLM_PROVIDER_ERROR covers everything from “the provider had a network
hiccup” to “the model refused to respond.” A single retry catches the
first; the second is permanent.
Reconnect for stream drops
If your SSE stream drops mid-turn, reconnect with replay:Don’t retry on permission errors
PERMISSION_DENIED means the user said no (or the static policy did).
Retrying produces another denial. Surface the denial to the user; let
them decide what to do.
Tool errors
Tool errors arrive astool_result.error, not as stream-ending errors.
The agent loop continues — the model gets the error in context and
decides how to react.
- Re-read with corrected input. “Oh, the file is at a different path — let me try again.”
- Switch tools. “I can’t read it directly; let me grep for it.”
- Ask the user. “I can’t find this file. Where does it live?”
MCP connection failures
When an MCP server is down, all calls to it fail. The circuit breaker inconnection_manager.py will eventually disconnect the server to stop
hammering. From your application, watch for MCP_CONNECTION_FAILED and
prompt the user to reconnect through /assets/connect.
MCP credential expiry
MCP_TOKEN_EXPIRED means the OAuth token expired and refresh failed.
The user has to re-authorize. Reconnecting through /assets/connect
starts a fresh OAuth flow.
HITL timeouts
When the agent emits anhitl_request, the stream stays open and the
loop doesn’t progress until you respond. There’s no built-in timeout —
the channel state TTL (CHANNEL_STATE_TTL_HITL, default 72 hours)
governs how long the loop survives.
After the TTL expires, the next interaction with that task starts fresh.
The HITL request is gone.
For application UX:
- Surface HITL requests prominently. They’re blocking.
- If the user closes the app and reopens, replay events from the last timestamp — the HITL request comes back.
- Provide a “cancel” path that resolves the HITL with a default answer (usually “no”) so the agent can finish or abort cleanly.
Graceful degradation
When something is broken, your application has options for what to do:Fall back to a cheaper model
For non-critical tasks, switch to a cheaper model on rate limits:LLM_MODEL=claude-sonnet-4-5 and one with a
smaller model.
Fall back to a previous answer
For chat-style applications, if the model is degraded, surface the previous turn’s content with a “I’m having trouble — try again shortly” message. Better than a spinner that never resolves.Queue and retry asynchronously
If the user’s task isn’t time-critical, queue the request when the Engine is unhealthy and retry in the background. Many integration flows don’t need immediate response.Degrade tool features
Some agents use 20+ tools. If MCP servers are flaky, the agent can still function with the platform tools alone. Detect MCP unavailability upstream and remove those tools from the catalog before issuing the request.Logging errors
Capture, in your application logs:- The error code.
- The
task_id. - The HTTP status (if applicable).
- Timing — when the error fired relative to the request.
- Retry count.
See also
- Errors — the complete error code catalog.
- Streaming — handling stream drops.
- Rate limits — provider-side rate-limiting in detail.
- Build a coding agent — error handling in a complete example.

