Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

bap-engine is a single FastAPI process backed by PostgreSQL. It speaks HTTP to products on one side and the Docker API to Engine containers on the other. This page maps what’s inside.

The picture

The components

HTTP API (server.py)

A FastAPI app with five route groups:
  • Engine lifecyclePOST /engines/provision, DELETE /engines/{user_id}, POST /engines/{user_id}/{start,stop, rotate-key}.
  • Discovery + admissionGET /engines/{user_id}, POST /engines/{user_id}/admit.
  • FleetGET /engines, GET /status, GET /metrics.
  • AdminPOST /products/register, PUT /products/{product_id}/policy. Requires X-Admin-Key.
  • HealthGET /health (no auth).
Auth is enforced at the route level by _get_product(), which hashes X-Platform-Key, looks up the matching product, and rejects unknown keys.

Registry (registry.py)

The CRUD layer over Postgres. Two main types:
  • Product — a tenant on the orchestrator. Has a slug, a hashed platform API key, and a JSON policy.
  • Engine — one user’s Engine instance. Has a status, a port, a hashed engine API key, and a Fernet-encrypted plaintext key.
Registry methods are pure — no side effects on Docker, no audit writes. Higher layers compose them.

Lifecycle (lifecycle.py)

The state machine for an Engine. Valid states:
provisioning → running → sleeping → running → stopped → destroying
                ↓                              ↑
              failed ──── (auto-restart) ─────┘
Each transition does the right Docker call (via the chosen backend), updates the registry, and writes an audit entry. Lifecycle is the only layer allowed to call into the backend — everyone else asks lifecycle.

Discovery (discovery.py)

The product’s main entry point goes through here: POST /engines/{user_id}/admit calls discovery.admit(), which:
  1. Asks policy.check_admit() — rate-limit + quota.
  2. Looks up the engine in the registry.
  3. If none and auto_provision=true and policy allows, provisions.
  4. If sleeping and auto_wake=true, wakes.
  5. Returns {admitted: true, engine: {url, api_key, ...}} or {admitted: false, ...}.
Discovery is the only layer that’s allowed to auto-provision. Direct calls to /engines/provision are explicit and skip the auto-policy.

Policy (policy.py)

Two checks:
  • Quota. A product’s policy carries max_engines: int. Provision fails if the count would exceed it.
  • Rate limit. A product’s policy carries rate_limit_rpm: int. Admit fails if the per-product per-minute rate would exceed it.
Policy is a simple PolicyEngine that reads the product’s policy JSON. No external store; no Redis. The state lives in the registry plus an in-memory rate counter.

Health (health.py)

A background loop that runs every ORCH_HEALTH_CHECK_INTERVAL_S (default 30 s). For each running engine, in parallel up to a semaphore of 50:
  1. GET /health against the engine’s URL with ORCH_HEALTH_CHECK_TIMEOUT_S (default 10 s).
  2. On 200, reset health_failures, update last_health_at.
  3. On failure, increment health_failures. After ORCH_HEALTH_MAX_FAILURES (default 3), mark the engine failed.
  4. Schedule auto-restart with exponential backoff: min(ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, ORCH_RESTART_BACKOFF_MAX_S).
  5. After ORCH_RESTART_MAX_ATTEMPTS (default 8), give up. The engine stays failed until ops intervenes.
Health is what makes the fleet self-healing.

Backends (backends/)

Two implementations of EngineBackend (an abstract base):
  • docker_backend.py — production. Creates Docker containers, mounts the data volume, attaches to the Docker network, runs the configured engine image.
  • subprocess_backend.py — dev/test. Spawns python -m uvicorn src.server:app as a local subprocess. No Docker.
The orchestrator picks via ORCH_ENGINE_BACKEND. See Backends.

Audit (audit.py)

Every lifecycle action writes a row to audit_log: action, actor, product/user/engine IDs, metadata, duration. Used for incident investigation and compliance reporting.

Metrics (metrics.py)

Aggregates over the registry and audit log:
  • GET /status — count by state, unhealthy count, overall health.
  • GET /metrics — provisions, crashes, restarts in the last hour; average boot time; lifetime totals.
Stateless reads; no metric store of its own.

What state lives where

StateWhereLifetime
Productsproducts tableForever
Enginesengines tableUntil destroyed
Port allocationsport_allocations tableUntil engine destroyed
Auditaudit_log tableForever (until pruned)
Rate-limit countersIn-memoryProcess lifetime
Engine API keys (plaintext)Never storedReturned to caller, then encrypted
Engine API keys (encrypted)engines.engine_key_encUntil destroyed
Engine API keys (hashed)engines.engine_key_hashUntil destroyed
The plaintext platform API key never lives in the orchestrator. The plaintext engine API key only exists at provision time (returned to the caller) and at start time (returned to the caller after decryption with ORCH_MASTER_KEY).

What runs where

ContainerImagePurpose
postgrespostgres:16-alpineOrchestrator state.
orchestratorbap-engine prod imageThe FastAPI process.
engine containersseptember-engine:<version>Per-user Engine instances. Started/stopped on demand.
The orchestrator and engines share a Docker network (engine_net) so the orchestrator can reach engine /health endpoints by container hostname.

Where it doesn’t go

Some choices the orchestrator deliberately doesn’t make:
  • Multi-region. All engines run on one host. For multi-region, you run multiple bap-engine deployments, one per region.
  • Migration. No live migration of an engine from one host to another. To move a user, destroy the source engine and provision a fresh one (their brain is in the volume; mount it on the new host).
  • Cross-product user reuse. A user_id is per-product. The orchestrator does not know that user u-123 in product A is the same human as user u-123 in product B.

See also