The Engine emits structured logs, observability events, and per-turn usage data. This page covers what’s available to your application and how to use it to understand what the agent is doing in production. For the operational view (dashboards, alerts, log shipping), see Operations → Observability.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What you can see
| Source | What it tells you |
|---|---|
| SSE stream | Per-turn behavior in real time. |
usage events | Token counts and cache-hit ratio per call. |
/health | Subsystem status. |
observability_events table | Persistent event log queryable via API. |
| Engine logs | Structured logs with request IDs. |
In-stream observability
Every/execute call emits these as part of the stream:
usage
After each model call:
usage events to get the turn’s total. The
cache_hit_tokens is part of input_tokens — it’s the portion that
was billed at the cache rate.
tool_call and tool_result
Tells you which tools the agent used and what they returned. Useful for
“why did the agent do X?” investigations.
compaction_event
If compaction triggered:
heartbeat
Periodic. Mostly useful for liveness; not interesting per-turn.
Per-turn metrics to capture
For each turn, your application should capture:task_id- Turn duration (start to
thread_lifecycle: completed) - Model calls count
- Tool calls (names and counts)
- Total token usage (sum of
usageevents) - Cache hit ratio (
cache_hit_tokens / input_tokens) - Compaction count
- HITL request count
- Final stop reason
The observability_events table
The Engine writes a row to observability_events for every meaningful
event:
- Errors.
- Slow tool calls.
- Permission denials.
- Migration runs.
- Catalog reloads.
- Cache-hit-ratio alerts (when
CACHE_HIT_MONITOR_ENABLED=true).
raven query API.
The raven query API
The Engine exposes an internal observability API at src/raven/api.py.
It’s not currently surfaced as a public HTTP endpoint, but the
underlying functions are stable:
- Engine writes to
observability_events. - A sidecar (or scheduled job) reads new rows.
- Sidecar pushes to your observability backend (Datadog, Honeycomb, etc.).
Engine logs
SetENGINE_LOG_LEVEL=INFO for production. Logs are JSON-structured:
request_id is the same across all logs for one HTTP request. The
task_id is the same across all logs for one task.
DEBUG logs include MCP request/response traces, model call
parameters, and context-assembly steps. Useful for development; too
noisy for production.
WARNING and ERROR are what you want to alert on.
Tracing
The Engine doesn’t currently emit OpenTelemetry traces, but therequest_id propagated through logs gives you a poor man’s trace —
grep for the request ID across the logs to see the full request path.
For real distributed tracing, add OTel instrumentation as a wrapper
around the Engine’s HTTP client and the Asset Directory’s MCP client.
What to alert on
For external observability:- Error rate > 1%. Something is broken.
- Cache hit ratio drops sharply. Indicates a cache invalidation bug, often after a deploy.
- Per-turn token count spikes. Often a sign of unbounded loops or failing compaction.
- HITL count spikes. Either the user is being asked too much (annoying) or the agent’s permission policy is misconfigured.
/healthreturns non-ok. Engine is unhealthy.MIGRATION_REQUIREDerrors. Engine is at an old schema version.- MCP failure rate spikes. A connector is broken.
What to dashboard
For day-to-day visibility:- Tasks per minute (rate).
- Per-turn duration distribution.
- Per-turn cost distribution.
- Cache hit ratio trend.
- Tool-call distribution.
- Top error codes.
- Compaction frequency.
- HITL frequency by category.
See also
- Operations → Observability — the ops view.
- Streaming events —
usage,compaction_event, etc. - GET /health — liveness endpoint.

