Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

Engines fail. Containers crash. A network blip drops a probe. The orchestrator runs a background health loop that detects failures quickly and restarts with exponential backoff. This page covers the loop and how to tune it.

The loop

Configuration

The loop is tuned via env vars. Defaults are sensible for most deployments; tune for special cases.
VariableDefaultEffect
ORCH_HEALTH_CHECK_INTERVAL_S30How often the loop sweeps every engine. Lower = faster failure detection, more probe traffic.
ORCH_HEALTH_CHECK_TIMEOUT_S10Per-probe timeout. Should be < interval.
ORCH_HEALTH_MAX_FAILURES3Consecutive failed probes before marking failed. Higher = tolerates more flakes.
ORCH_RESTART_BACKOFF_BASE_S5First retry waits this long. Doubles each attempt.
ORCH_RESTART_BACKOFF_MAX_S300Cap on backoff (5 min).
ORCH_RESTART_MAX_ATTEMPTS8Give up auto-restart after this many failures.
ORCH_BOOT_TIMEOUT_S60Provision-time boot deadline (separate from steady-state).

What “healthy” means

The orchestrator probes GET /health on the engine. The engine must return:
{ "status": "ok", ... }
Any of these counts as a failure:
  • HTTP timeout (connection or read).
  • Non-2xx status code.
  • 200 status but status != "ok" in the body.
The engine’s /health is implemented in engine/src/server.py. It checks: database reachable, LLM provider reachable (last call succeeded), Asset Directory reachable. If any subsystem is down, the body’s status is non-"ok" and the orchestrator counts it as a failure.

What “failed” means

After ORCH_HEALTH_MAX_FAILURES consecutive failures, the orchestrator:
  1. Sets status='failed' in the registry.
  2. Writes an audit row with action health_failed.
  3. Schedules an auto-restart for BACKOFF_BASE * 2^0 seconds later (5s by default).
A failed engine is invisible to the product unless it explicitly queries — GET /admit returns admitted: false with reason engine_unhealthy. If auto_provision=true is passed and policy allows, a new engine takes its place; otherwise the product sees the failure.

Auto-restart

The orchestrator restarts a failed engine by calling the backend’s start(). For Docker, this is docker.containers.start(). After each restart attempt:
  • If GET /health returns ok within ORCH_HEALTH_CHECK_TIMEOUT_S, reset to status='running', health_failures=0. Audit auto_restart_success.
  • If the restart itself fails, or /health keeps failing, increment the attempt counter and schedule the next retry with doubled backoff.
After ORCH_RESTART_MAX_ATTEMPTS (default 8) consecutive failed restarts, the orchestrator stops trying. The engine stays failed. Audit auto_restart_gave_up. Ops alerts should fire here. The total elapsed time before giving up:
sum(min(5*2^i, 300) for i in range(8))
= 5 + 10 + 20 + 40 + 80 + 160 + 300 + 300
= 915 seconds (~15 min)
So a truly broken engine spends about 15 minutes thrashing before the orchestrator quits. Tune down for smaller fleets where you’d rather page sooner; tune up for large fleets where transient provider outages would otherwise destroy half the fleet.

What auto-restart fixes

  • Container OOM kills.
  • LLM provider transient outage (the engine’s /health will go ok again once the provider recovers).
  • Brief network partitions.
  • Engines that crashed during a deploy and haven’t been bumped.

What it doesn’t fix

  • Misconfiguration. If LLM_API_KEY is wrong, restarting won’t help.
  • Disk full. Restart hits the same disk error.
  • Deeper bugs that surface every time. Restart loops them.
For these, the auto-restart eventually gives up. The audit table tells you the engine is hopelessly stuck; you investigate.

Concurrency

The health loop runs all probes for one tick in parallel up to a semaphore of 50. For larger fleets, the per-tick wall time is (num_engines / 50) × ORCH_HEALTH_CHECK_TIMEOUT_S. With 1000 engines and a 10s timeout, that’s 200 seconds per sweep — uncomfortable when the interval is 30s. If you run >500 engines, raise the semaphore (a code change today; config in a future version) or increase the interval to give the loop time to finish.

Tuning by use case

Demo / small fleet

ORCH_HEALTH_CHECK_INTERVAL_S=10
ORCH_HEALTH_MAX_FAILURES=2
ORCH_RESTART_MAX_ATTEMPTS=5
Fast detection, fast give-up. Engineers see failures within seconds.

Production / large fleet

ORCH_HEALTH_CHECK_INTERVAL_S=30
ORCH_HEALTH_MAX_FAILURES=3
ORCH_RESTART_BACKOFF_BASE_S=10
ORCH_RESTART_MAX_ATTEMPTS=8
Tolerant of transient noise. Doesn’t thrash on provider blips.

Provider outage tolerance

ORCH_HEALTH_MAX_FAILURES=5
ORCH_RESTART_BACKOFF_MAX_S=600
ORCH_RESTART_MAX_ATTEMPTS=12
Survives a 30-minute provider outage without destroying engines. Slower to restart genuinely-stuck engines.

Observability

Watch:
  • engines.health_failures — sustained > 0 across many engines means a systemic issue (provider, network).
  • Audit auto_restart_* actions — high rate = thrashing.
  • engines.status='failed' count over time — should approach 0 in steady state.
Wire alerts to:
  • Restart rate > 5/min (Sev2).
  • Failed engines > 1% of fleet sustained 5 min (Sev2).
  • Single engine in failed for > 1 hour (Sev3).

See also