Engine lifecycle

Every engine the orchestrator manages moves through a small, well-defined state machine. Each transition does specific work (Docker call, registry update, audit entry) and emits an audit row. This page is the reference for the states and the transitions.

The states

State definitions:

State	Meaning	Container	Brain volume	API key
`provisioning`	Boot in progress	created, may not be running	created	generated, encrypted
`running`	Healthy, serving requests	running	mounted	active
`sleeping`	Idle; container running but no recent traffic	running	mounted	active
`stopped`	Manually stopped	stopped	preserved	preserved
`failed`	Health probe failed past threshold	varies	preserved	preserved
`destroying`	Tear-down in progress	being removed	being removed	being revoked

sleeping is informational: the container itself is still up, but the orchestrator’s idle detector has marked the engine as non-actively-serving. There’s no separate Docker state today.

The transitions

Provision (`POST /engines/provision`)

[empty] → provisioning → running (or failed)

Steps inside the orchestrator:

Generate raw engine API key (sk-sept-<random>).
Hash (SHA-256) and Fernet-encrypt with ORCH_MASTER_KEY.
Create the engine row in engines with status='provisioning'.
Allocate a port from [ORCH_PORT_MIN, ORCH_PORT_MAX] using a PG advisory lock.
Call the backend’s create() — Docker container or subprocess.
Poll GET /health on the engine until 200 (or ORCH_BOOT_TIMEOUT_S, default 60s).
On success: set status='running', record boot_duration_ms, audit provision.
On timeout/error: set status='failed', audit failure.

The plaintext key is returned in the response and never stored.

Admit (`POST /engines/{user_id}/admit`)

(any) → (status_unchanged or auto_provision/auto_wake)

The product’s main entry point. Doesn’t directly transition state unless auto_provision or auto_wake is true:

If no engine exists and auto_provision=true: provisions (transition above).
If engine is sleeping and auto_wake=true: wakes (transitions to running).
Otherwise: returns the current state.

Start (`POST /engines/{user_id}/start`)

sleeping → running
stopped → running

Calls the backend’s start(). For Docker, this is container.start(). For subprocess, it’s not supported — start a fresh subprocess via re-create instead.

Stop (`POST /engines/{user_id}/stop`)

running → stopped

Calls the backend’s stop() with a 30s grace period (SIGTERM → wait → SIGKILL). The engine’s graceful shutdown should complete in-flight turns and persist working memory. The brain volume and registry row are preserved.

Sleep (idle detector)

running → sleeping

The orchestrator marks engines as sleeping if last_health_at is older than ORCH_IDLE_SLEEP_THRESHOLD_S (default 1 hour). The container stays up; the state change is mostly informational. Future versions may aggressively stop sleeping containers to free RAM.

Wake

sleeping → running

Triggered by a fresh /admit with auto_wake=true or an explicit /start. Resets the activity timestamp.

Failure

running → failed

The health monitor marks an engine failed after ORCH_HEALTH_MAX_FAILURES consecutive missed probes. See Health for the loop.

Auto-restart

failed → running (success)
failed → failed (retry exceeded)

The orchestrator schedules restart attempts with exponential backoff (ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, capped at ORCH_RESTART_BACKOFF_MAX_S). After ORCH_RESTART_MAX_ATTEMPTS (default 8) consecutive failed restarts, the engine stays failed and the orchestrator stops trying. Ops has to intervene.

Rotate key (`POST /engines/{user_id}/rotate-key`)

running → running

Generates a new key, swaps the hash on the engine container (via POST /rpc or env reload), updates the registry. The plaintext is returned to the caller. Old key invalid immediately.

Destroy (`DELETE /engines/{user_id}`)

(any) → destroying → [removed]

Set status='destroying'.
Stop the container if running.
Remove the container.
Remove the brain volume (engine-data-{engine_id}).
Free the port (port_allocations row deleted via FK cascade).
Delete the engine row.
Audit destroy.

Once a destroy completes, the user’s data is unrecoverable. Take a backup first if you need it.

Concurrency

Multiple lifecycle operations against the same user_id serialize on the database row. Concurrent calls either return 409 (already in the requested state) or queue behind whatever’s in flight. There’s no direct background-task queue; FastAPI’s request handling does the serialization. Across users, lifecycle operations run independently.

Audit

Every transition writes a row to audit_log with the action name (provision, start, stop, destroy, wake, auto_restart, rotate_key, health_check), the actor (product slug or system), the duration in milliseconds, and metadata. Use the audit table for incident investigation:

SELECT timestamp, action, actor, metadata
FROM audit_log
WHERE engine_id = '<id>'
ORDER BY timestamp DESC
LIMIT 50;

What lives across transitions

The brain SQLite file in the engine’s volume — preserved through stop, failed, sleep. Only destroy removes it.
The engine API key (encrypted) — stable through every transition except rotate-key and destroy.
The port — held until destroy. Sleeping engines keep their port.

​The states

​The transitions

​Provision (POST /engines/provision)

​Admit (POST /engines/{user_id}/admit)

​Start (POST /engines/{user_id}/start)

​Stop (POST /engines/{user_id}/stop)

​Sleep (idle detector)

​Wake

​Failure

​Auto-restart

​Rotate key (POST /engines/{user_id}/rotate-key)

​Destroy (DELETE /engines/{user_id})

​Concurrency

​Audit

​What lives across transitions

​See also