The picture
The components
HTTP API (server.py)
A FastAPI app with five route groups:
- Engine lifecycle —
POST /engines/provision,DELETE /engines/{user_id},POST /engines/{user_id}/{start,stop, rotate-key}. - Discovery + admission —
GET /engines/{user_id},POST /engines/{user_id}/admit. - Fleet —
GET /engines,GET /status,GET /metrics. - Admin —
POST /products/register,PUT /products/{product_id}/policy. RequiresX-Admin-Key. - Health —
GET /health(no auth).
_get_product(), which hashes
X-Platform-Key, looks up the matching product, and rejects unknown
keys.
Registry (registry.py)
The CRUD layer over Postgres. Two main types:
Product— a tenant on the orchestrator. Has a slug, a hashed platform API key, and a JSON policy.Engine— one user’s Engine instance. Has a status, a port, a hashed engine API key, and a Fernet-encrypted plaintext key.
Lifecycle (lifecycle.py)
The state machine for an Engine. Valid states:
Discovery (discovery.py)
The product’s main entry point goes through here:
POST /engines/{user_id}/admit calls discovery.admit(), which:
- Asks
policy.check_admit()— rate-limit + quota. - Looks up the engine in the registry.
- If none and
auto_provision=trueand policy allows, provisions. - If sleeping and
auto_wake=true, wakes. - Returns
{admitted: true, engine: {url, api_key, ...}}or{admitted: false, ...}.
/engines/provision are explicit and skip the auto-policy.
Policy (policy.py)
Two checks:
- Quota. A product’s policy carries
max_engines: int. Provision fails if the count would exceed it. - Rate limit. A product’s policy carries
rate_limit_rpm: int. Admit fails if the per-product per-minute rate would exceed it.
PolicyEngine that reads the product’s policy
JSON. No external store; no Redis. The state lives in the registry
plus an in-memory rate counter.
Health (health.py)
A background loop that runs every ORCH_HEALTH_CHECK_INTERVAL_S
(default 30 s). For each running engine, in parallel up to a
semaphore of 50:
GET /healthagainst the engine’s URL withORCH_HEALTH_CHECK_TIMEOUT_S(default 10 s).- On 200, reset
health_failures, updatelast_health_at. - On failure, increment
health_failures. AfterORCH_HEALTH_MAX_FAILURES(default 3), mark the enginefailed. - Schedule auto-restart with exponential backoff:
min(ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, ORCH_RESTART_BACKOFF_MAX_S). - After
ORCH_RESTART_MAX_ATTEMPTS(default 8), give up. The engine staysfaileduntil ops intervenes.
Backends (backends/)
Two implementations of EngineBackend (an abstract base):
docker_backend.py— production. Creates Docker containers, mounts the data volume, attaches to the Docker network, runs the configured engine image.subprocess_backend.py— dev/test. Spawnspython -m uvicorn src.server:appas a local subprocess. No Docker.
ORCH_ENGINE_BACKEND. See
Backends.
Audit (audit.py)
Every lifecycle action writes a row to audit_log: action, actor,
product/user/engine IDs, metadata, duration. Used for incident
investigation and compliance reporting.
Metrics (metrics.py)
Aggregates over the registry and audit log:
GET /status— count by state, unhealthy count, overall health.GET /metrics— provisions, crashes, restarts in the last hour; average boot time; lifetime totals.
What state lives where
| State | Where | Lifetime |
|---|---|---|
| Products | products table | Forever |
| Engines | engines table | Until destroyed |
| Port allocations | port_allocations table | Until engine destroyed |
| Audit | audit_log table | Forever (until pruned) |
| Rate-limit counters | In-memory | Process lifetime |
| Engine API keys (plaintext) | Never stored | Returned to caller, then encrypted |
| Engine API keys (encrypted) | engines.engine_key_enc | Until destroyed |
| Engine API keys (hashed) | engines.engine_key_hash | Until destroyed |
ORCH_MASTER_KEY).
What runs where
| Container | Image | Purpose |
|---|---|---|
postgres | postgres:16-alpine | Orchestrator state. |
orchestrator | bap-engine prod image | The FastAPI process. |
engine containers | september-engine:<version> | Per-user Engine instances. Started/stopped on demand. |
engine_net) so the orchestrator can reach engine /health
endpoints by container hostname.
Where it doesn’t go
Some choices the orchestrator deliberately doesn’t make:- Multi-region. All engines run on one host. For multi-region, you run multiple bap-engine deployments, one per region.
- Migration. No live migration of an engine from one host to another. To move a user, destroy the source engine and provision a fresh one (their brain is in the volume; mount it on the new host).
- Cross-product user reuse. A user_id is per-product. The
orchestrator does not know that user
u-123in product A is the same human as useru-123in product B.
See also
- Engine contract — the API surface between orchestrator and engines.
- Lifecycle — the state machine in detail.
- Backends — Docker vs subprocess.
- Health — auto-restart and the loop.

