What the Engine emits
| Source | Format | Use |
|---|---|---|
| stdout/stderr | JSON-structured logs | Log shipping (ELK, Datadog, Honeycomb) |
observability_events table | SQLite rows | Internal metrics, eventually exported |
| SSE stream (per-call) | events including usage, compaction_event | Per-call instrumentation |
/health endpoint | JSON | Liveness probes |
Log format
Every log line is JSON:level—DEBUG,INFO,WARNING,ERROR. Set the floor withENGINE_LOG_LEVEL.logger— the Python module that emitted the log. Useful for filtering.request_id— unique per HTTP request. Trace one request across many log lines.task_id— the task the request belongs to. Trace across requests in the same conversation.message— the human-readable summary.extra— structured fields specific to the log type.
Log levels by environment
| Env | Recommended level |
|---|---|
| Local dev | DEBUG (when debugging), INFO otherwise |
| Staging | INFO |
| Production | INFO |
Log shipping
The Engine writes to stdout. Ship from there.Docker / docker-compose
Kubernetes
Use a sidecar log collector or a node-level agent. Logs go to/var/log/containers/; the agent picks them up and routes.
Log retention
Engine logs are not durable on the local container’s filesystem. Ship them to a backend that retains for as long as you need. 30-day retention is typical; some compliance scenarios require longer.Metrics
The Engine doesn’t ship a Prometheus exporter today. Two paths to metrics:Mine the logs
Most useful metrics can be derived from the structured logs:engine_requests_total{path}— count ofStarting executionlog lines.engine_errors_total{code}— count ofERROR-level log lines by error code.engine_request_duration_seconds_bucket{path}— histogram from request start/end log lines.
Read observability_events
The Engine writes structured events to the brain database:
Useful metrics to track
| Metric | Source | Why |
|---|---|---|
| Request rate | logs | Capacity planning |
| Error rate by code | logs | Health |
| P50/P99 latency | logs | UX |
| Token usage per turn | SSE usage events | Cost |
| Cache hit ratio | SSE usage.cache_hit_tokens | Cost optimization |
| Compaction rate | SSE compaction_event | Context-pressure indicator |
| HITL rate | SSE hitl_request | Permission-policy calibration |
| Tool call distribution | SSE tool_call | Behavior shifts |
| MCP failure rate | logs | Connector health |
| Brain size | filesystem | Storage capacity |
Traces
The Engine doesn’t emit OpenTelemetry traces today. The poor man’s trace isrequest_id propagated through every log line — grep for the
ID to see the full request path.
For real distributed tracing:
- Instrument upstream of the Engine with OTel.
- Pass
traceparentas a header. - The Engine doesn’t propagate this today; you’d need to add the propagation in the FastAPI middleware.
/health endpoint
status: "ok" means all subsystems are reachable. Anything else
means at least one is degraded — read subsystems.
Wire to:
- Load balancer health probes.
- Container liveness/readiness probes.
- Status-page checks.
Alerts
For production, alert on:| Symptom | Threshold | Severity |
|---|---|---|
/health non-ok | 1 minute | Sev1 |
| Error rate > 5% | 5 minutes | Sev2 |
| P99 latency > 30s | 5 minutes | Sev2 |
| Cache hit ratio drops > 30% from baseline | 15 minutes | Sev3 |
| Compaction rate > 50% of turns | 1 hour | Sev3 |
MIGRATION_REQUIRED errors | any | Sev1 |
LLM_RATE_LIMITED rate > 1% | 5 minutes | Sev2 |
| MCP failure rate > 10% (per server) | 15 minutes | Sev2 |
| Disk usage > 80% | 1 hour | Sev2 |
| Container OOM | any | Sev2 |
Dashboards
A useful Engine dashboard has:Overview
- Request rate (line)
- Error rate by code (stacked area)
- P50/P99 latency (line)
- Active task count (line)
Cost
- Tokens per minute, input vs output (stacked area)
- Cost per minute (line)
- Cache hit ratio (line)
Behavior
- Tool call distribution (pie / table)
- HITL request rate (line)
- Compaction rate (line)
Health
/healthstatus (single stat)- Subsystem status grid
- Brain size (line, slow-moving)
- MCP connection count by status (stacked bar)
Audit trail
For deployments with audit/compliance requirements, theobservability_events table is the source of truth. Periodic export
to a long-term store (S3, BigQuery) gives you a tamper-evident audit
trail.
Specifically captured:
- Permission decisions (allowed / denied / prompt fired).
- Migration runs.
- Catalog reloads.
- API key validation events (success / failure).
- MCP connection lifecycle.
See also
- Threat model — what observability defends against.
- Health and feedback endpoints — the public surface.

