Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

Every engine the orchestrator manages moves through a small, well-defined state machine. Each transition does specific work (Docker call, registry update, audit entry) and emits an audit row. This page is the reference for the states and the transitions.

The states

State definitions:
StateMeaningContainerBrain volumeAPI key
provisioningBoot in progresscreated, may not be runningcreatedgenerated, encrypted
runningHealthy, serving requestsrunningmountedactive
sleepingIdle; container running but no recent trafficrunningmountedactive
stoppedManually stoppedstoppedpreservedpreserved
failedHealth probe failed past thresholdvariespreservedpreserved
destroyingTear-down in progressbeing removedbeing removedbeing revoked
sleeping is informational: the container itself is still up, but the orchestrator’s idle detector has marked the engine as non-actively-serving. There’s no separate Docker state today.

The transitions

Provision (POST /engines/provision)

[empty] → provisioning → running (or failed)
Steps inside the orchestrator:
  1. Generate raw engine API key (sk-sept-<random>).
  2. Hash (SHA-256) and Fernet-encrypt with ORCH_MASTER_KEY.
  3. Create the engine row in engines with status='provisioning'.
  4. Allocate a port from [ORCH_PORT_MIN, ORCH_PORT_MAX] using a PG advisory lock.
  5. Call the backend’s create() — Docker container or subprocess.
  6. Poll GET /health on the engine until 200 (or ORCH_BOOT_TIMEOUT_S, default 60s).
  7. On success: set status='running', record boot_duration_ms, audit provision.
  8. On timeout/error: set status='failed', audit failure.
The plaintext key is returned in the response and never stored.

Admit (POST /engines/{user_id}/admit)

(any) → (status_unchanged or auto_provision/auto_wake)
The product’s main entry point. Doesn’t directly transition state unless auto_provision or auto_wake is true:
  • If no engine exists and auto_provision=true: provisions (transition above).
  • If engine is sleeping and auto_wake=true: wakes (transitions to running).
  • Otherwise: returns the current state.

Start (POST /engines/{user_id}/start)

sleeping → running
stopped → running
Calls the backend’s start(). For Docker, this is container.start(). For subprocess, it’s not supported — start a fresh subprocess via re-create instead.

Stop (POST /engines/{user_id}/stop)

running → stopped
Calls the backend’s stop() with a 30s grace period (SIGTERM → wait → SIGKILL). The engine’s graceful shutdown should complete in-flight turns and persist working memory. The brain volume and registry row are preserved.

Sleep (idle detector)

running → sleeping
The orchestrator marks engines as sleeping if last_health_at is older than ORCH_IDLE_SLEEP_THRESHOLD_S (default 1 hour). The container stays up; the state change is mostly informational. Future versions may aggressively stop sleeping containers to free RAM.

Wake

sleeping → running
Triggered by a fresh /admit with auto_wake=true or an explicit /start. Resets the activity timestamp.

Failure

running → failed
The health monitor marks an engine failed after ORCH_HEALTH_MAX_FAILURES consecutive missed probes. See Health for the loop.

Auto-restart

failed → running (success)
failed → failed (retry exceeded)
The orchestrator schedules restart attempts with exponential backoff (ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, capped at ORCH_RESTART_BACKOFF_MAX_S). After ORCH_RESTART_MAX_ATTEMPTS (default 8) consecutive failed restarts, the engine stays failed and the orchestrator stops trying. Ops has to intervene.

Rotate key (POST /engines/{user_id}/rotate-key)

running → running
Generates a new key, swaps the hash on the engine container (via POST /rpc or env reload), updates the registry. The plaintext is returned to the caller. Old key invalid immediately.

Destroy (DELETE /engines/{user_id})

(any) → destroying → [removed]
  1. Set status='destroying'.
  2. Stop the container if running.
  3. Remove the container.
  4. Remove the brain volume (engine-data-{engine_id}).
  5. Free the port (port_allocations row deleted via FK cascade).
  6. Delete the engine row.
  7. Audit destroy.
Once a destroy completes, the user’s data is unrecoverable. Take a backup first if you need it.

Concurrency

Multiple lifecycle operations against the same user_id serialize on the database row. Concurrent calls either return 409 (already in the requested state) or queue behind whatever’s in flight. There’s no direct background-task queue; FastAPI’s request handling does the serialization. Across users, lifecycle operations run independently.

Audit

Every transition writes a row to audit_log with the action name (provision, start, stop, destroy, wake, auto_restart, rotate_key, health_check), the actor (product slug or system), the duration in milliseconds, and metadata. Use the audit table for incident investigation:
SELECT timestamp, action, actor, metadata
FROM audit_log
WHERE engine_id = '<id>'
ORDER BY timestamp DESC
LIMIT 50;

What lives across transitions

  • The brain SQLite file in the engine’s volume — preserved through stop, failed, sleep. Only destroy removes it.
  • The engine API key (encrypted) — stable through every transition except rotate-key and destroy.
  • The port — held until destroy. Sleeping engines keep their port.

See also