Durability and resume

Tasks live longer than HTTP connections. A /execute call can run for minutes; the user’s laptop can sleep; the SSE stream can drop. The Engine is designed so none of those break the agent’s progress. This page covers the durability primitives and how to use them from your client.

What’s durable

In the brain database:

Task state. Every conversation thread, by task_id. Persists indefinitely.
Channel state. Mid-execution checkpoints. TTL-bound (default 3 h active, 72 h HITL).
Working memory. Within-task transient context. TTL-bound.
Long-term memory. Episodes, knowledge, social graph. Indefinite.

What’s NOT durable:

The HTTP connection. It can drop at any time.
Sandbox state across tasks. Sandbox workspaces reset between tasks.
Background processes. The watchdog kills orphans when the task ends.

Three failure scenarios

1. Client disconnects mid-stream

The client’s network drops, or the laptop sleeps, or the user closes the tab. The Engine is still running the turn. What to do: reconnect with replay.

GET /execute/replay?after=<last_seen_timestamp>
X-Engine-Key: <key>

The Engine replays every event emitted on this task after the timestamp. Pass 0 to replay everything available. The events come from the channel state, which persists for CHANNEL_STATE_TTL_ACTIVE (default 3 hours). After that, replay returns nothing — the task continues but its event buffer is gone.

2. Engine restarts mid-turn

The Engine container gets killed (deploy, OOM, host failure) while a turn is in flight. What to do: nothing special — the Engine’s graceful shutdown drains in-flight turns. If the shutdown was abrupt (kill -9), the turn dies mid-stream. The task and all prior memory are intact; the in-flight turn is lost. The client should detect this (the SSE stream closes with no proper thread_lifecycle: completed) and re-issue the call. The user’s message is whatever you sent; resending it picks up where things were before the half-completed turn.

3. HITL request, no immediate answer

The agent emits a hitl_request and waits. The user is at lunch. What to do: nothing — the channel state holds the paused state for CHANNEL_STATE_TTL_HITL (default 72 hours). When the user comes back and your client posts to /hitl/respond, the loop resumes. If the user doesn’t come back within the TTL, the task continues to exist but the in-flight turn is lost. The next interaction with that task starts a fresh turn.

Channel state, in depth

The Engine snapshots execution state to channel_state_snapshots at key boundaries:

After every model call.
After every tool result.
When emitting a HITL request.
On graceful shutdown.

Each snapshot is keyed by task_id and timestamped. Replay reads events emitted after a given timestamp. Two TTLs apply:

TTL	Default	Used for
`CHANNEL_STATE_TTL_ACTIVE`	10800 s (3 h)	Active turns — the agent is running.
`CHANNEL_STATE_TTL_HITL`	259200 s (72 h)	Paused turns waiting on HITL.

The longer HITL TTL exists because user response time is unpredictable. Tasks blocked on permission prompts shouldn’t expire after lunch.

Implementing resilient streaming

A robust client looks like this:

class EngineClient:
    def __init__(self, url, key):
        self.url = url
        self.key = key
        self.last_ts = {}  # task_id -> last seen timestamp

    def stream(self, message, task_id):
        try:
            for event in self._fresh_stream(message, task_id):
                self.last_ts[task_id] = event["timestamp"]
                yield event
        except (ConnectionError, ReadTimeout):
            # reconnect via replay
            yield from self._replay(task_id)

    def _replay(self, task_id):
        last = self.last_ts.get(task_id, 0)
        for event in self._raw_replay_stream(task_id, after=last):
            self.last_ts[task_id] = event["timestamp"]
            yield event

The pattern: track last_ts per task; on disconnect, replay from there.

Resuming after the user closes the app

For mobile or tab-style apps where the user might close and reopen hours later:

Persist task_id and last_ts to the user’s device.
When the user reopens, call /execute/replay?after=<last_ts>.
If the response is empty (TTL expired), the in-flight turn (if any) is gone. Surface “your last task expired” and let them start fresh.
If a HITL request comes back through replay, surface the prompt to the user.

Active concurrency

Per task_id, only one /execute runs at a time. A second concurrent call returns 409 Conflict. If you legitimately need to run two things at once for the same user, use two task_ids. They share the brain (so they share long-term memory) but their channel states are independent.

Cleanup

Tasks accumulate in the brain. Periodic cleanup is your responsibility or a future Engine feature:

Old channel states expire automatically (TTL).
Old trajectories are processed by the Learning Centre but not deleted by default. Cap retention with a periodic cleanup job.
Conversations persist until you delete them.

A typical retention policy: keep raw trajectories for 30 days, archive to cold storage for 1 year, delete after.

​What’s durable

​Three failure scenarios

​1. Client disconnects mid-stream

​2. Engine restarts mid-turn

​3. HITL request, no immediate answer

​Channel state, in depth

​Implementing resilient streaming

​Resuming after the user closes the app

​Active concurrency

​Cleanup

​See also