Observability

You can’t run what you can’t see. The Engine emits structured logs, in-database observability events, and per-turn usage data. This page covers what’s available and how to wire it into your observability stack. For the application-developer view (consuming events client-side), see Platform → Observability.

What the Engine emits

Source	Format	Use
stdout/stderr	JSON-structured logs	Log shipping (ELK, Datadog, Honeycomb)
`observability_events` table	SQLite rows	Internal metrics, eventually exported
SSE stream (per-call)	events including `usage`, `compaction_event`	Per-call instrumentation
`/health` endpoint	JSON	Liveness probes

Log format

Every log line is JSON:

{
  "timestamp": "2026-04-27T12:34:56.123Z",
  "level": "INFO",
  "logger": "engine_core.coordinator",
  "request_id": "req-abc123",
  "task_id": "task-001",
  "message": "Starting execution",
  "extra": { ... }
}

Key fields:

level — DEBUG, INFO, WARNING, ERROR. Set the floor with ENGINE_LOG_LEVEL.
logger — the Python module that emitted the log. Useful for filtering.
request_id — unique per HTTP request. Trace one request across many log lines.
task_id — the task the request belongs to. Trace across requests in the same conversation.
message — the human-readable summary.
extra — structured fields specific to the log type.

Log levels by environment

Env	Recommended level
Local dev	`DEBUG` (when debugging), `INFO` otherwise
Staging	`INFO`
Production	`INFO`

DEBUG logs include MCP request/response traces, model parameters, and context-assembly steps. Useful for one-off investigations; too noisy to leave on always.

Log shipping

The Engine writes to stdout. Ship from there.

Docker / docker-compose

services:
  engine:
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "5"

Then ship via your log driver of choice (Datadog Agent, Fluentd, Vector, Promtail).

Kubernetes

Use a sidecar log collector or a node-level agent. Logs go to /var/log/containers/; the agent picks them up and routes.

Log retention

Engine logs are not durable on the local container’s filesystem. Ship them to a backend that retains for as long as you need. 30-day retention is typical; some compliance scenarios require longer.

Metrics

The Engine doesn’t ship a Prometheus exporter today. Two paths to metrics:

Mine the logs

Most useful metrics can be derived from the structured logs:

engine_requests_total{path} — count of Starting execution log lines.
engine_errors_total{code} — count of ERROR-level log lines by error code.
engine_request_duration_seconds_bucket{path} — histogram from request start/end log lines.

Tools like Vector or Loki can extract metrics from logs as they ship.

Read `observability_events`

The Engine writes structured events to the brain database:

SELECT event_type, COUNT(*)
FROM observability_events
WHERE created_at > datetime('now', '-1 hour')
GROUP BY event_type;

A sidecar can poll this table and push to your metrics backend.

Useful metrics to track

Metric	Source	Why
Request rate	logs	Capacity planning
Error rate by code	logs	Health
P50/P99 latency	logs	UX
Token usage per turn	SSE `usage` events	Cost
Cache hit ratio	SSE `usage.cache_hit_tokens`	Cost optimization
Compaction rate	SSE `compaction_event`	Context-pressure indicator
HITL rate	SSE `hitl_request`	Permission-policy calibration
Tool call distribution	SSE `tool_call`	Behavior shifts
MCP failure rate	logs	Connector health
Brain size	filesystem	Storage capacity

Traces

The Engine doesn’t emit OpenTelemetry traces today. The poor man’s trace is request_id propagated through every log line — grep for the ID to see the full request path. For real distributed tracing:

Instrument upstream of the Engine with OTel.
Pass traceparent as a header.
The Engine doesn’t propagate this today; you’d need to add the propagation in the FastAPI middleware.

This is a known gap. OpenTelemetry support is on the roadmap.

`/health` endpoint

curl -fsS "$ENGINE_URL/health"

{
  "status": "ok",
  "uptime_seconds": 3600.5,
  "subsystems": {
    "database": "ok",
    "llm_provider": "ok",
    "asset_directory": "ok",
    "learning_centre": "ok"
  }
}

status: "ok" means all subsystems are reachable. Anything else means at least one is degraded — read subsystems. Wire to:

Load balancer health probes.
Container liveness/readiness probes.
Status-page checks.

Alerts

For production, alert on:

Symptom	Threshold	Severity
`/health` non-ok	1 minute	Sev1
Error rate > 5%	5 minutes	Sev2
P99 latency > 30s	5 minutes	Sev2
Cache hit ratio drops > 30% from baseline	15 minutes	Sev3
Compaction rate > 50% of turns	1 hour	Sev3
`MIGRATION_REQUIRED` errors	any	Sev1
`LLM_RATE_LIMITED` rate > 1%	5 minutes	Sev2
MCP failure rate > 10% (per server)	15 minutes	Sev2
Disk usage > 80%	1 hour	Sev2
Container OOM	any	Sev2

Wire each alert to your incident tooling so the on-call gets paged.

Dashboards

A useful Engine dashboard has:

Overview

Request rate (line)
Error rate by code (stacked area)
P50/P99 latency (line)
Active task count (line)

Cost

Tokens per minute, input vs output (stacked area)
Cost per minute (line)
Cache hit ratio (line)

Behavior

Tool call distribution (pie / table)
HITL request rate (line)
Compaction rate (line)

Health

/health status (single stat)
Subsystem status grid
Brain size (line, slow-moving)
MCP connection count by status (stacked bar)

Audit trail

For deployments with audit/compliance requirements, the observability_events table is the source of truth. Periodic export to a long-term store (S3, BigQuery) gives you a tamper-evident audit trail. Specifically captured:

Permission decisions (allowed / denied / prompt fired).
Migration runs.
Catalog reloads.
API key validation events (success / failure).
MCP connection lifecycle.

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Observability

What the Engine emits

Log format

Log levels by environment

Log shipping

Docker / docker-compose

Kubernetes

Log retention

Metrics

Mine the logs

Read `observability_events`

Useful metrics to track

Traces

`/health` endpoint

Alerts

Dashboards

Overview

Cost

Behavior

Health

Audit trail

See also

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Documentation Index

​What the Engine emits

​Log format

​Log levels by environment

​Log shipping

​Docker / docker-compose

​Kubernetes

​Log retention

​Metrics

​Mine the logs

​Read observability_events

​Useful metrics to track

​Traces

​/health endpoint

​Alerts

​Dashboards

​Overview

​Cost

​Behavior

​Health

​Audit trail

​See also

What the Engine emits

Log format

Log levels by environment

Log shipping

Docker / docker-compose

Kubernetes

Log retention

Metrics

Mine the logs

Read `observability_events`

Useful metrics to track

Traces

`/health` endpoint

Alerts

Dashboards

Overview

Cost

Behavior

Health

Audit trail

See also