Every engine the orchestrator manages moves through a small, well-defined state machine. Each transition does specific work (Docker call, registry update, audit entry) and emits an audit row. This page is the reference for the states and the transitions.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
The states
State definitions:| State | Meaning | Container | Brain volume | API key |
|---|---|---|---|---|
provisioning | Boot in progress | created, may not be running | created | generated, encrypted |
running | Healthy, serving requests | running | mounted | active |
sleeping | Idle; container running but no recent traffic | running | mounted | active |
stopped | Manually stopped | stopped | preserved | preserved |
failed | Health probe failed past threshold | varies | preserved | preserved |
destroying | Tear-down in progress | being removed | being removed | being revoked |
sleeping is informational: the container itself is still up, but
the orchestrator’s idle detector has marked the engine as
non-actively-serving. There’s no separate Docker state today.
The transitions
Provision (POST /engines/provision)
- Generate raw engine API key (
sk-sept-<random>). - Hash (SHA-256) and Fernet-encrypt with
ORCH_MASTER_KEY. - Create the engine row in
engineswithstatus='provisioning'. - Allocate a port from
[ORCH_PORT_MIN, ORCH_PORT_MAX]using a PG advisory lock. - Call the backend’s
create()— Docker container or subprocess. - Poll
GET /healthon the engine until 200 (orORCH_BOOT_TIMEOUT_S, default 60s). - On success: set
status='running', recordboot_duration_ms, auditprovision. - On timeout/error: set
status='failed', audit failure.
Admit (POST /engines/{user_id}/admit)
auto_provision or auto_wake is true:
- If no engine exists and
auto_provision=true: provisions (transition above). - If engine is
sleepingandauto_wake=true: wakes (transitions torunning). - Otherwise: returns the current state.
Start (POST /engines/{user_id}/start)
start(). For Docker, this is container.start().
For subprocess, it’s not supported — start a fresh subprocess via
re-create instead.
Stop (POST /engines/{user_id}/stop)
stop() with a 30s grace period
(SIGTERM → wait → SIGKILL). The engine’s graceful shutdown should
complete in-flight turns and persist working memory.
The brain volume and registry row are preserved.
Sleep (idle detector)
sleeping if last_health_at is
older than ORCH_IDLE_SLEEP_THRESHOLD_S (default 1 hour). The
container stays up; the state change is mostly informational. Future
versions may aggressively stop sleeping containers to free RAM.
Wake
/admit with auto_wake=true or an explicit
/start. Resets the activity timestamp.
Failure
failed after
ORCH_HEALTH_MAX_FAILURES consecutive missed probes. See
Health for the loop.
Auto-restart
ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, capped at
ORCH_RESTART_BACKOFF_MAX_S). After
ORCH_RESTART_MAX_ATTEMPTS (default 8) consecutive failed restarts,
the engine stays failed and the orchestrator stops trying. Ops has
to intervene.
Rotate key (POST /engines/{user_id}/rotate-key)
POST /rpc or env reload), updates the registry. The plaintext is
returned to the caller. Old key invalid immediately.
Destroy (DELETE /engines/{user_id})
- Set
status='destroying'. - Stop the container if running.
- Remove the container.
- Remove the brain volume (
engine-data-{engine_id}). - Free the port (
port_allocationsrow deleted via FK cascade). - Delete the engine row.
- Audit
destroy.
Concurrency
Multiple lifecycle operations against the sameuser_id serialize on
the database row. Concurrent calls either return 409 (already in
the requested state) or queue behind whatever’s in flight. There’s no
direct background-task queue; FastAPI’s request handling does the
serialization.
Across users, lifecycle operations run independently.
Audit
Every transition writes a row toaudit_log with the action name
(provision, start, stop, destroy, wake, auto_restart,
rotate_key, health_check), the actor (product slug or system),
the duration in milliseconds, and metadata.
Use the audit table for incident investigation:
What lives across transitions
- The brain SQLite file in the engine’s volume — preserved through
stop,failed,sleep. Onlydestroyremoves it. - The engine API key (encrypted) — stable through every transition
except
rotate-keyanddestroy. - The port — held until
destroy. Sleeping engines keep their port.
See also
- Architecture — where this lives in code.
- Health — the auto-restart loop.
- Backends — what each transition does at the Docker/subprocess level.
- API reference: lifecycle endpoints.

