Architecture

bap-engine is a single FastAPI process backed by PostgreSQL. It speaks HTTP to products on one side and the Docker API to Engine containers on the other. This page maps what’s inside.

The picture

The components

HTTP API (`server.py`)

A FastAPI app with five route groups:

Engine lifecycle — POST /engines/provision, DELETE /engines/{user_id}, POST /engines/{user_id}/{start,stop, rotate-key}.
Discovery + admission — GET /engines/{user_id}, POST /engines/{user_id}/admit.
Fleet — GET /engines, GET /status, GET /metrics.
Admin — POST /products/register, PUT /products/{product_id}/policy. Requires X-Admin-Key.
Health — GET /health (no auth).

Auth is enforced at the route level by _get_product(), which hashes X-Platform-Key, looks up the matching product, and rejects unknown keys.

Registry (`registry.py`)

The CRUD layer over Postgres. Two main types:

Product — a tenant on the orchestrator. Has a slug, a hashed platform API key, and a JSON policy.
Engine — one user’s Engine instance. Has a status, a port, a hashed engine API key, and a Fernet-encrypted plaintext key.

Registry methods are pure — no side effects on Docker, no audit writes. Higher layers compose them.

Lifecycle (`lifecycle.py`)

The state machine for an Engine. Valid states:

provisioning → running → sleeping → running → stopped → destroying
                ↓                              ↑
              failed ──── (auto-restart) ─────┘

Each transition does the right Docker call (via the chosen backend), updates the registry, and writes an audit entry. Lifecycle is the only layer allowed to call into the backend — everyone else asks lifecycle.

Discovery (`discovery.py`)

The product’s main entry point goes through here: POST /engines/{user_id}/admit calls discovery.admit(), which:

Asks policy.check_admit() — rate-limit + quota.
Looks up the engine in the registry.
If none and auto_provision=true and policy allows, provisions.
If sleeping and auto_wake=true, wakes.
Returns {admitted: true, engine: {url, api_key, ...}} or {admitted: false, ...}.

Discovery is the only layer that’s allowed to auto-provision. Direct calls to /engines/provision are explicit and skip the auto-policy.

Policy (`policy.py`)

Two checks:

Quota. A product’s policy carries max_engines: int. Provision fails if the count would exceed it.
Rate limit. A product’s policy carries rate_limit_rpm: int. Admit fails if the per-product per-minute rate would exceed it.

Policy is a simple PolicyEngine that reads the product’s policy JSON. No external store; no Redis. The state lives in the registry plus an in-memory rate counter.

Health (`health.py`)

A background loop that runs every ORCH_HEALTH_CHECK_INTERVAL_S (default 30 s). For each running engine, in parallel up to a semaphore of 50:

GET /health against the engine’s URL with ORCH_HEALTH_CHECK_TIMEOUT_S (default 10 s).
On 200, reset health_failures, update last_health_at.
On failure, increment health_failures. After ORCH_HEALTH_MAX_FAILURES (default 3), mark the engine failed.
Schedule auto-restart with exponential backoff: min(ORCH_RESTART_BACKOFF_BASE_S * 2^attempt, ORCH_RESTART_BACKOFF_MAX_S).
After ORCH_RESTART_MAX_ATTEMPTS (default 8), give up. The engine stays failed until ops intervenes.

Health is what makes the fleet self-healing.

Backends (`backends/`)

Two implementations of EngineBackend (an abstract base):

docker_backend.py — production. Creates Docker containers, mounts the data volume, attaches to the Docker network, runs the configured engine image.
subprocess_backend.py — dev/test. Spawns python -m uvicorn src.server:app as a local subprocess. No Docker.

The orchestrator picks via ORCH_ENGINE_BACKEND. See Backends.

Audit (`audit.py`)

Every lifecycle action writes a row to audit_log: action, actor, product/user/engine IDs, metadata, duration. Used for incident investigation and compliance reporting.

Metrics (`metrics.py`)

Aggregates over the registry and audit log:

GET /status — count by state, unhealthy count, overall health.
GET /metrics — provisions, crashes, restarts in the last hour; average boot time; lifetime totals.

Stateless reads; no metric store of its own.

What state lives where

State	Where	Lifetime
Products	`products` table	Forever
Engines	`engines` table	Until destroyed
Port allocations	`port_allocations` table	Until engine destroyed
Audit	`audit_log` table	Forever (until pruned)
Rate-limit counters	In-memory	Process lifetime
Engine API keys (plaintext)	Never stored	Returned to caller, then encrypted
Engine API keys (encrypted)	`engines.engine_key_enc`	Until destroyed
Engine API keys (hashed)	`engines.engine_key_hash`	Until destroyed

The plaintext platform API key never lives in the orchestrator. The plaintext engine API key only exists at provision time (returned to the caller) and at start time (returned to the caller after decryption with ORCH_MASTER_KEY).

What runs where

Container	Image	Purpose
`postgres`	`postgres:16-alpine`	Orchestrator state.
`orchestrator`	bap-engine prod image	The FastAPI process.
`engine` containers	`september-engine:<version>`	Per-user Engine instances. Started/stopped on demand.

The orchestrator and engines share a Docker network (engine_net) so the orchestrator can reach engine /health endpoints by container hostname.

Where it doesn’t go

Some choices the orchestrator deliberately doesn’t make:

Multi-region. All engines run on one host. For multi-region, you run multiple bap-engine deployments, one per region.
Migration. No live migration of an engine from one host to another. To move a user, destroy the source engine and provision a fresh one (their brain is in the volume; mount it on the new host).
Cross-product user reuse. A user_id is per-product. The orchestrator does not know that user u-123 in product A is the same human as user u-123 in product B.

Architecture

BAP Engine

Engineering

Architecture

The picture

The components

HTTP API (`server.py`)

Registry (`registry.py`)

Lifecycle (`lifecycle.py`)

Discovery (`discovery.py`)

Policy (`policy.py`)

Health (`health.py`)

Backends (`backends/`)

Audit (`audit.py`)

Metrics (`metrics.py`)

What state lives where

What runs where

Where it doesn’t go

See also

Architecture

BAP Engine

Engineering

Documentation Index

​The picture

​The components

​HTTP API (server.py)

​Registry (registry.py)

​Lifecycle (lifecycle.py)

​Discovery (discovery.py)

​Policy (policy.py)

​Health (health.py)

​Backends (backends/)

​Audit (audit.py)

​Metrics (metrics.py)

​What state lives where

​What runs where

​Where it doesn’t go

​See also

The picture

The components

HTTP API (`server.py`)

Registry (`registry.py`)

Lifecycle (`lifecycle.py`)

Discovery (`discovery.py`)

Policy (`policy.py`)

Health (`health.py`)

Backends (`backends/`)

Audit (`audit.py`)

Metrics (`metrics.py`)

What state lives where

What runs where

Where it doesn’t go

See also