Documentation Index
Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
You can’t run what you can’t see. The Engine emits structured logs,
in-database observability events, and per-turn usage data. This page
covers what’s available and how to wire it into your observability
stack.
For the application-developer view (consuming events client-side), see
Platform → Observability.
What the Engine emits
| Source | Format | Use |
|---|
| stdout/stderr | JSON-structured logs | Log shipping (ELK, Datadog, Honeycomb) |
observability_events table | SQLite rows | Internal metrics, eventually exported |
| SSE stream (per-call) | events including usage, compaction_event | Per-call instrumentation |
/health endpoint | JSON | Liveness probes |
Every log line is JSON:
{
"timestamp": "2026-04-27T12:34:56.123Z",
"level": "INFO",
"logger": "engine_core.coordinator",
"request_id": "req-abc123",
"task_id": "task-001",
"message": "Starting execution",
"extra": { ... }
}
Key fields:
level — DEBUG, INFO, WARNING, ERROR. Set the floor with
ENGINE_LOG_LEVEL.
logger — the Python module that emitted the log. Useful for
filtering.
request_id — unique per HTTP request. Trace one request across
many log lines.
task_id — the task the request belongs to. Trace across requests
in the same conversation.
message — the human-readable summary.
extra — structured fields specific to the log type.
Log levels by environment
| Env | Recommended level |
|---|
| Local dev | DEBUG (when debugging), INFO otherwise |
| Staging | INFO |
| Production | INFO |
DEBUG logs include MCP request/response traces, model parameters, and
context-assembly steps. Useful for one-off investigations; too noisy
to leave on always.
Log shipping
The Engine writes to stdout. Ship from there.
Docker / docker-compose
services:
engine:
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"
Then ship via your log driver of choice (Datadog Agent, Fluentd,
Vector, Promtail).
Kubernetes
Use a sidecar log collector or a node-level agent. Logs go to
/var/log/containers/; the agent picks them up and routes.
Log retention
Engine logs are not durable on the local container’s filesystem.
Ship them to a backend that retains for as long as you need. 30-day
retention is typical; some compliance scenarios require longer.
Metrics
The Engine doesn’t ship a Prometheus exporter today. Two paths to
metrics:
Mine the logs
Most useful metrics can be derived from the structured logs:
engine_requests_total{path} — count of Starting execution log
lines.
engine_errors_total{code} — count of ERROR-level log lines by
error code.
engine_request_duration_seconds_bucket{path} — histogram from
request start/end log lines.
Tools like Vector or Loki can extract metrics from logs as they ship.
Read observability_events
The Engine writes structured events to the brain database:
SELECT event_type, COUNT(*)
FROM observability_events
WHERE created_at > datetime('now', '-1 hour')
GROUP BY event_type;
A sidecar can poll this table and push to your metrics backend.
Useful metrics to track
| Metric | Source | Why |
|---|
| Request rate | logs | Capacity planning |
| Error rate by code | logs | Health |
| P50/P99 latency | logs | UX |
| Token usage per turn | SSE usage events | Cost |
| Cache hit ratio | SSE usage.cache_hit_tokens | Cost optimization |
| Compaction rate | SSE compaction_event | Context-pressure indicator |
| HITL rate | SSE hitl_request | Permission-policy calibration |
| Tool call distribution | SSE tool_call | Behavior shifts |
| MCP failure rate | logs | Connector health |
| Brain size | filesystem | Storage capacity |
Traces
The Engine doesn’t emit OpenTelemetry traces today. The poor man’s
trace is request_id propagated through every log line — grep for the
ID to see the full request path.
For real distributed tracing:
- Instrument upstream of the Engine with OTel.
- Pass
traceparent as a header.
- The Engine doesn’t propagate this today; you’d need to add the
propagation in the FastAPI middleware.
This is a known gap. OpenTelemetry support is on the roadmap.
/health endpoint
curl -fsS "$ENGINE_URL/health"
{
"status": "ok",
"uptime_seconds": 3600.5,
"subsystems": {
"database": "ok",
"llm_provider": "ok",
"asset_directory": "ok",
"learning_centre": "ok"
}
}
status: "ok" means all subsystems are reachable. Anything else
means at least one is degraded — read subsystems.
Wire to:
- Load balancer health probes.
- Container liveness/readiness probes.
- Status-page checks.
Alerts
For production, alert on:
| Symptom | Threshold | Severity |
|---|
/health non-ok | 1 minute | Sev1 |
| Error rate > 5% | 5 minutes | Sev2 |
| P99 latency > 30s | 5 minutes | Sev2 |
| Cache hit ratio drops > 30% from baseline | 15 minutes | Sev3 |
| Compaction rate > 50% of turns | 1 hour | Sev3 |
MIGRATION_REQUIRED errors | any | Sev1 |
LLM_RATE_LIMITED rate > 1% | 5 minutes | Sev2 |
| MCP failure rate > 10% (per server) | 15 minutes | Sev2 |
| Disk usage > 80% | 1 hour | Sev2 |
| Container OOM | any | Sev2 |
Wire each alert to your incident tooling so the on-call gets paged.
Dashboards
A useful Engine dashboard has:
Overview
- Request rate (line)
- Error rate by code (stacked area)
- P50/P99 latency (line)
- Active task count (line)
Cost
- Tokens per minute, input vs output (stacked area)
- Cost per minute (line)
- Cache hit ratio (line)
Behavior
- Tool call distribution (pie / table)
- HITL request rate (line)
- Compaction rate (line)
Health
/health status (single stat)
- Subsystem status grid
- Brain size (line, slow-moving)
- MCP connection count by status (stacked bar)
Audit trail
For deployments with audit/compliance requirements, the
observability_events table is the source of truth. Periodic export
to a long-term store (S3, BigQuery) gives you a tamper-evident audit
trail.
Specifically captured:
- Permission decisions (allowed / denied / prompt fired).
- Migration runs.
- Catalog reloads.
- API key validation events (success / failure).
- MCP connection lifecycle.
See also