The states
State definitions:| State | Meaning | Container | Brain volume | API key |
|---|---|---|---|---|
provisioning | Boot in progress | created, may not be running | created | generated, encrypted |
running | Healthy, serving requests | running | mounted | active |
sleeping | Idle; container running but no recent traffic | running | mounted | active |
stopped | Manually stopped | stopped | preserved | preserved |
failed | Health probe failed past threshold | varies | preserved | preserved |
destroying | Tear-down in progress | being removed | being removed | being revoked |
sleeping is informational: the container itself is still up, but
the orchestrator’s idle detector has marked the engine as
non-actively-serving. There’s no separate Docker state today.
The transitions
Provision (POST /engines/provision)
- Generate raw engine API key (
sk-sept-<random>). - Hash (SHA-256) and Fernet-encrypt with
ORCH_MASTER_KEY. - Create the engine row in
engineswithstatus='provisioning'. - Allocate a port from
[ORCH_PORT_MIN, ORCH_PORT_MAX]using a PG advisory lock. - Call the backend’s
create()— Docker container or subprocess. - Poll
GET /healthon the engine until 200 (orORCH_BOOT_TIMEOUT_S, default 60s). - On success: set
status='running', recordboot_duration_ms, auditprovision. - On timeout/error: set
status='failed', audit failure.
Admit (POST /engines/{user_id}/admit)
auto_provision or auto_wake is true:
- If no engine exists and
auto_provision=true: provisions (transition above). - If engine is
sleepingandauto_wake=true: wakes (transitions torunning). - Otherwise: returns the current state.
Start (POST /engines/{user_id}/start)
start(). For Docker, this is container.start().
For subprocess, it’s not supported — start a fresh subprocess via
re-create instead.
Stop (POST /engines/{user_id}/stop)
stop() with a 30s grace period
(SIGTERM → wait → SIGKILL). The engine’s graceful shutdown should
complete in-flight turns and persist working memory.
The brain volume and registry row are preserved.
Sleep (idle detector)
sleeping if last_health_at is
older than ORCH_IDLE_SLEEP_THRESHOLD_S (default 1 hour). The
container stays up; the state change is mostly informational. Future
versions may aggressively stop sleeping containers to free RAM.
Wake
/admit with auto_wake=true or an explicit
/start. Resets the activity timestamp.
Failure
failed after
ORCH_HEALTH_MAX_FAILURES consecutive missed probes. See
Health for the loop.
Auto-restart
ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, capped at
ORCH_RESTART_BACKOFF_MAX_S). After
ORCH_RESTART_MAX_ATTEMPTS (default 8) consecutive failed restarts,
the engine stays failed and the orchestrator stops trying. Ops has
to intervene.
Rotate key (POST /engines/{user_id}/rotate-key)
POST /rpc or env reload), updates the registry. The plaintext is
returned to the caller. Old key invalid immediately.
Destroy (DELETE /engines/{user_id})
- Set
status='destroying'. - Stop the container if running.
- Remove the container.
- Remove the brain volume (
engine-data-{engine_id}). - Free the port (
port_allocationsrow deleted via FK cascade). - Delete the engine row.
- Audit
destroy.
Concurrency
Multiple lifecycle operations against the sameuser_id serialize on
the database row. Concurrent calls either return 409 (already in
the requested state) or queue behind whatever’s in flight. There’s no
direct background-task queue; FastAPI’s request handling does the
serialization.
Across users, lifecycle operations run independently.
Audit
Every transition writes a row toaudit_log with the action name
(provision, start, stop, destroy, wake, auto_restart,
rotate_key, health_check), the actor (product slug or system),
the duration in milliseconds, and metadata.
Use the audit table for incident investigation:
What lives across transitions
- The brain SQLite file in the engine’s volume — preserved through
stop,failed,sleep. Onlydestroyremoves it. - The engine API key (encrypted) — stable through every transition
except
rotate-keyanddestroy. - The port — held until
destroy. Sleeping engines keep their port.
See also
- Architecture — where this lives in code.
- Health — the auto-restart loop.
- Backends — what each transition does at the Docker/subprocess level.
- API reference: lifecycle endpoints.

