Error handling

Every production integration with the Engine needs a story for failures. The model returns an error. The provider rate-limits. The MCP server’s OAuth token expired. Your client lost the SSE stream. This page covers how the Engine surfaces failures and what to do about each kind.

Where errors come from

There are five sources of failures:

Engine itself. Misconfiguration (bad env var, missing migration), internal bug, resource exhaustion. Returns 4xx/5xx HTTP.
LLM provider. Rate limits, transient failures, content policy refusals, region issues. Surfaces as LLM_* errors.
Tools. Sandbox-rejected commands, MCP server errors, expired credentials. Surfaces as tool_result.error events.
HITL timeouts. The agent paused for an answer that never came. The stream stays open; the loop doesn’t progress.
Network and client. Lost SSE connections, slow consumers, broken reverse proxies.

Each has its own retry strategy.

Error envelope

For non-2xx HTTP responses, the body is:

{
  "error": "<stable_code>",
  "message": "<human description>",
  "details": { /* optional */ }
}

For in-stream errors, the SSE event is:

event: error
data: { "error": "<stable_code>", "message": "...", "details": {} }

Use error for routing logic; message is human-readable but not stable.

Common errors and what to do

Code	Status	Retry?	Strategy
`INVALID_KEY`	401	No	Fix the key.
`KEY_ROTATION_PENDING`	401	Yes	Retry with the new key.
`TASK_NOT_FOUND`	404	No	Don’t reference dead task IDs.
`HITL_REQUIRED`	409	No	Respond to the HITL request first.
`LLM_RATE_LIMITED`	429	Yes	Exponential backoff, starting at 1s.
`LLM_PROVIDER_ERROR`	502	Sometimes	Retry once. If still failing, surface to user.
`CONTEXT_OVERFLOW`	422	No	Start a new task or trim.
`PERMISSION_DENIED`	403	No	Surface to user; don’t auto-retry.
`MCP_CONNECTION_FAILED`	502	Sometimes	Retry once; otherwise surface.
`MCP_TOKEN_EXPIRED`	401	No	Reconnect via `/assets/connect`.
`MIGRATION_REQUIRED`	503	No	Engine needs ops attention.

For the full code catalog see Errors.

Retry patterns

Exponential backoff for rate limits

def with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimited as e:
            if attempt == max_attempts - 1:
                raise
            wait = (2 ** attempt) + random.random()
            time.sleep(wait)

Start at 1s, double each retry, add jitter. After 5 attempts (which is ~30 seconds total), give up.

Retry-once for transient provider errors

LLM_PROVIDER_ERROR covers everything from “the provider had a network hiccup” to “the model refused to respond.” A single retry catches the first; the second is permanent.

try:
    response = execute(message, task_id)
except ProviderError:
    response = execute(message, task_id)  # one retry

If both fail, the model is genuinely refusing or the provider is genuinely down. Surface to the user; don’t keep hammering.

Reconnect for stream drops

If your SSE stream drops mid-turn, reconnect with replay:

last_ts = 0

while True:
    try:
        for event in stream_execute(task_id, after=last_ts):
            handle(event)
            last_ts = event["timestamp"]
    except ConnectionError:
        time.sleep(1)
        continue
    else:
        break

The Engine kept running while you were disconnected. Replay catches you up.

Don’t retry on permission errors

PERMISSION_DENIED means the user said no (or the static policy did). Retrying produces another denial. Surface the denial to the user; let them decide what to do.

Tool errors

Tool errors arrive as tool_result.error, not as stream-ending errors. The agent loop continues — the model gets the error in context and decides how to react.

event: tool_result
data: {
  "tool": "read_file",
  "tool_call_id": "tu_...",
  "output": "",
  "error": "File not found: /data/missing.md"
}

Three patterns the model uses to handle tool errors:

Re-read with corrected input. “Oh, the file is at a different path — let me try again.”
Switch tools. “I can’t read it directly; let me grep for it.”
Ask the user. “I can’t find this file. Where does it live?”

You don’t usually need to do anything about tool errors at the application level — they’re routine. The exceptions are MCP errors:

MCP connection failures

When an MCP server is down, all calls to it fail. The circuit breaker in connection_manager.py will eventually disconnect the server to stop hammering. From your application, watch for MCP_CONNECTION_FAILED and prompt the user to reconnect through /assets/connect.

MCP credential expiry

MCP_TOKEN_EXPIRED means the OAuth token expired and refresh failed. The user has to re-authorize. Reconnecting through /assets/connect starts a fresh OAuth flow.

HITL timeouts

When the agent emits an hitl_request, the stream stays open and the loop doesn’t progress until you respond. There’s no built-in timeout — the channel state TTL (CHANNEL_STATE_TTL_HITL, default 72 hours) governs how long the loop survives. After the TTL expires, the next interaction with that task starts fresh. The HITL request is gone. For application UX:

Surface HITL requests prominently. They’re blocking.
If the user closes the app and reopens, replay events from the last timestamp — the HITL request comes back.
Provide a “cancel” path that resolves the HITL with a default answer (usually “no”) so the agent can finish or abort cleanly.

Graceful degradation

When something is broken, your application has options for what to do:

Fall back to a cheaper model

For non-critical tasks, switch to a cheaper model on rate limits:

try:
    return execute_against("primary-engine", message, task_id)
except RateLimited:
    return execute_against("light-engine", message, task_id)

Run two Engines, one with LLM_MODEL=claude-sonnet-4-5 and one with a smaller model.

Fall back to a previous answer

For chat-style applications, if the model is degraded, surface the previous turn’s content with a “I’m having trouble — try again shortly” message. Better than a spinner that never resolves.

Queue and retry asynchronously

If the user’s task isn’t time-critical, queue the request when the Engine is unhealthy and retry in the background. Many integration flows don’t need immediate response.

Degrade tool features

Some agents use 20+ tools. If MCP servers are flaky, the agent can still function with the platform tools alone. Detect MCP unavailability upstream and remove those tools from the catalog before issuing the request.

Logging errors

Capture, in your application logs:

The error code.
The task_id.
The HTTP status (if applicable).
Timing — when the error fired relative to the request.
Retry count.

Don’t log the request body verbatim; it may contain user data. Log a hash or a redacted version.

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Error handling

Where errors come from

Error envelope

Common errors and what to do

Retry patterns

Exponential backoff for rate limits

Retry-once for transient provider errors

Reconnect for stream drops

Don’t retry on permission errors

Tool errors

MCP connection failures

MCP credential expiry

HITL timeouts

Graceful degradation

Fall back to a cheaper model

Fall back to a previous answer

Queue and retry asynchronously

Degrade tool features

Logging errors

See also

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Documentation Index

​Where errors come from

​Error envelope

​Common errors and what to do

​Retry patterns

​Exponential backoff for rate limits

​Retry-once for transient provider errors

​Reconnect for stream drops

​Don’t retry on permission errors

​Tool errors

​MCP connection failures

​MCP credential expiry

​HITL timeouts

​Graceful degradation

​Fall back to a cheaper model

​Fall back to a previous answer

​Queue and retry asynchronously

​Degrade tool features

​Logging errors

​See also

Where errors come from

Error envelope

Common errors and what to do

Retry patterns

Exponential backoff for rate limits

Retry-once for transient provider errors

Reconnect for stream drops

Don’t retry on permission errors

Tool errors

MCP connection failures

MCP credential expiry

HITL timeouts

Graceful degradation

Fall back to a cheaper model

Fall back to a previous answer

Queue and retry asynchronously

Degrade tool features

Logging errors

See also