Skip to main content

Documentation Index

Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

You can’t run what you can’t see. The Engine emits structured logs, in-database observability events, and per-turn usage data. This page covers what’s available and how to wire it into your observability stack. For the application-developer view (consuming events client-side), see Platform → Observability.

What the Engine emits

SourceFormatUse
stdout/stderrJSON-structured logsLog shipping (ELK, Datadog, Honeycomb)
observability_events tableSQLite rowsInternal metrics, eventually exported
SSE stream (per-call)events including usage, compaction_eventPer-call instrumentation
/health endpointJSONLiveness probes

Log format

Every log line is JSON:
{
  "timestamp": "2026-04-27T12:34:56.123Z",
  "level": "INFO",
  "logger": "engine_core.coordinator",
  "request_id": "req-abc123",
  "task_id": "task-001",
  "message": "Starting execution",
  "extra": { ... }
}
Key fields:
  • levelDEBUG, INFO, WARNING, ERROR. Set the floor with ENGINE_LOG_LEVEL.
  • logger — the Python module that emitted the log. Useful for filtering.
  • request_id — unique per HTTP request. Trace one request across many log lines.
  • task_id — the task the request belongs to. Trace across requests in the same conversation.
  • message — the human-readable summary.
  • extra — structured fields specific to the log type.

Log levels by environment

EnvRecommended level
Local devDEBUG (when debugging), INFO otherwise
StagingINFO
ProductionINFO
DEBUG logs include MCP request/response traces, model parameters, and context-assembly steps. Useful for one-off investigations; too noisy to leave on always.

Log shipping

The Engine writes to stdout. Ship from there.

Docker / docker-compose

services:
  engine:
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "5"
Then ship via your log driver of choice (Datadog Agent, Fluentd, Vector, Promtail).

Kubernetes

Use a sidecar log collector or a node-level agent. Logs go to /var/log/containers/; the agent picks them up and routes.

Log retention

Engine logs are not durable on the local container’s filesystem. Ship them to a backend that retains for as long as you need. 30-day retention is typical; some compliance scenarios require longer.

Metrics

The Engine doesn’t ship a Prometheus exporter today. Two paths to metrics:

Mine the logs

Most useful metrics can be derived from the structured logs:
  • engine_requests_total{path} — count of Starting execution log lines.
  • engine_errors_total{code} — count of ERROR-level log lines by error code.
  • engine_request_duration_seconds_bucket{path} — histogram from request start/end log lines.
Tools like Vector or Loki can extract metrics from logs as they ship.

Read observability_events

The Engine writes structured events to the brain database:
SELECT event_type, COUNT(*)
FROM observability_events
WHERE created_at > datetime('now', '-1 hour')
GROUP BY event_type;
A sidecar can poll this table and push to your metrics backend.

Useful metrics to track

MetricSourceWhy
Request ratelogsCapacity planning
Error rate by codelogsHealth
P50/P99 latencylogsUX
Token usage per turnSSE usage eventsCost
Cache hit ratioSSE usage.cache_hit_tokensCost optimization
Compaction rateSSE compaction_eventContext-pressure indicator
HITL rateSSE hitl_requestPermission-policy calibration
Tool call distributionSSE tool_callBehavior shifts
MCP failure ratelogsConnector health
Brain sizefilesystemStorage capacity

Traces

The Engine doesn’t emit OpenTelemetry traces today. The poor man’s trace is request_id propagated through every log line — grep for the ID to see the full request path. For real distributed tracing:
  1. Instrument upstream of the Engine with OTel.
  2. Pass traceparent as a header.
  3. The Engine doesn’t propagate this today; you’d need to add the propagation in the FastAPI middleware.
This is a known gap. OpenTelemetry support is on the roadmap.

/health endpoint

curl -fsS "$ENGINE_URL/health"
{
  "status": "ok",
  "uptime_seconds": 3600.5,
  "subsystems": {
    "database": "ok",
    "llm_provider": "ok",
    "asset_directory": "ok",
    "learning_centre": "ok"
  }
}
status: "ok" means all subsystems are reachable. Anything else means at least one is degraded — read subsystems. Wire to:
  • Load balancer health probes.
  • Container liveness/readiness probes.
  • Status-page checks.

Alerts

For production, alert on:
SymptomThresholdSeverity
/health non-ok1 minuteSev1
Error rate > 5%5 minutesSev2
P99 latency > 30s5 minutesSev2
Cache hit ratio drops > 30% from baseline15 minutesSev3
Compaction rate > 50% of turns1 hourSev3
MIGRATION_REQUIRED errorsanySev1
LLM_RATE_LIMITED rate > 1%5 minutesSev2
MCP failure rate > 10% (per server)15 minutesSev2
Disk usage > 80%1 hourSev2
Container OOManySev2
Wire each alert to your incident tooling so the on-call gets paged.

Dashboards

A useful Engine dashboard has:

Overview

  • Request rate (line)
  • Error rate by code (stacked area)
  • P50/P99 latency (line)
  • Active task count (line)

Cost

  • Tokens per minute, input vs output (stacked area)
  • Cost per minute (line)
  • Cache hit ratio (line)

Behavior

  • Tool call distribution (pie / table)
  • HITL request rate (line)
  • Compaction rate (line)

Health

  • /health status (single stat)
  • Subsystem status grid
  • Brain size (line, slow-moving)
  • MCP connection count by status (stacked bar)

Audit trail

For deployments with audit/compliance requirements, the observability_events table is the source of truth. Periodic export to a long-term store (S3, BigQuery) gives you a tamper-evident audit trail. Specifically captured:
  • Permission decisions (allowed / denied / prompt fired).
  • Migration runs.
  • Catalog reloads.
  • API key validation events (success / failure).
  • MCP connection lifecycle.

See also