Observability

The Engine emits structured logs, observability events, and per-turn usage data. This page covers what’s available to your application and how to use it to understand what the agent is doing in production. For the operational view (dashboards, alerts, log shipping), see Operations → Observability.

What you can see

Source	What it tells you
SSE stream	Per-turn behavior in real time.
`usage` events	Token counts and cache-hit ratio per call.
`/health`	Subsystem status.
`observability_events` table	Persistent event log queryable via API.
Engine logs	Structured logs with request IDs.

In-stream observability

Every /execute call emits these as part of the stream:

`usage`

After each model call:

{
  "input_tokens": 1240,
  "output_tokens": 312,
  "cache_hit_tokens": 980
}

Sum across all usage events to get the turn’s total. The cache_hit_tokens is part of input_tokens — it’s the portion that was billed at the cache rate.

`tool_call` and `tool_result`

Tells you which tools the agent used and what they returned. Useful for “why did the agent do X?” investigations.

`compaction_event`

If compaction triggered:

{
  "before_tokens": 24000,
  "after_tokens": 8500,
  "strategy": "summary"
}

Frequent compaction events suggest your task is too long for the context window. Worth investigating.

`heartbeat`

Periodic. Mostly useful for liveness; not interesting per-turn.

Per-turn metrics to capture

For each turn, your application should capture:

task_id
Turn duration (start to thread_lifecycle: completed)
Model calls count
Tool calls (names and counts)
Total token usage (sum of usage events)
Cache hit ratio (cache_hit_tokens / input_tokens)
Compaction count
HITL request count
Final stop reason

Logging these per-turn lets you build dashboards over time.

The `observability_events` table

The Engine writes a row to observability_events for every meaningful event:

Errors.
Slow tool calls.
Permission denials.
Migration runs.
Catalog reloads.
Cache-hit-ratio alerts (when CACHE_HIT_MONITOR_ENABLED=true).

The schema:

CREATE TABLE observability_events (
  id INTEGER PRIMARY KEY,
  event_type TEXT,
  data_json TEXT,
  created_at TEXT
);

Query it directly via SQLite or via the raven query API.

The `raven` query API

The Engine exposes an internal observability API at src/raven/api.py. It’s not currently surfaced as a public HTTP endpoint, but the underlying functions are stable:

from raven import api

events = api.query(
    event_type="LLM_RATE_LIMITED",
    since="2026-04-20T00:00:00Z",
    limit=100,
)

For external dashboards, the typical pattern is:

Engine writes to observability_events.
A sidecar (or scheduled job) reads new rows.
Sidecar pushes to your observability backend (Datadog, Honeycomb, etc.).

Engine logs

Set ENGINE_LOG_LEVEL=INFO for production. Logs are JSON-structured:

{
  "timestamp": "2026-04-27T12:34:56.123Z",
  "level": "INFO",
  "logger": "engine_core.coordinator",
  "request_id": "req-abc123",
  "task_id": "task-001",
  "message": "Starting execution",
  "extra": { ... }
}

The request_id is the same across all logs for one HTTP request. The task_id is the same across all logs for one task. DEBUG logs include MCP request/response traces, model call parameters, and context-assembly steps. Useful for development; too noisy for production. WARNING and ERROR are what you want to alert on.

Tracing

The Engine doesn’t currently emit OpenTelemetry traces, but the request_id propagated through logs gives you a poor man’s trace — grep for the request ID across the logs to see the full request path. For real distributed tracing, add OTel instrumentation as a wrapper around the Engine’s HTTP client and the Asset Directory’s MCP client.

What to alert on

For external observability:

Error rate > 1%. Something is broken.
Cache hit ratio drops sharply. Indicates a cache invalidation bug, often after a deploy.
Per-turn token count spikes. Often a sign of unbounded loops or failing compaction.
HITL count spikes. Either the user is being asked too much (annoying) or the agent’s permission policy is misconfigured.
/health returns non-ok. Engine is unhealthy.
MIGRATION_REQUIRED errors. Engine is at an old schema version.
MCP failure rate spikes. A connector is broken.

What to dashboard

For day-to-day visibility:

Tasks per minute (rate).
Per-turn duration distribution.
Per-turn cost distribution.
Cache hit ratio trend.
Tool-call distribution.
Top error codes.
Compaction frequency.
HITL frequency by category.

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Observability

What you can see

In-stream observability

`usage`

`tool_call` and `tool_result`

`compaction_event`

`heartbeat`

Per-turn metrics to capture

The `observability_events` table

The `raven` query API

Engine logs

Tracing

What to alert on

What to dashboard

See also

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Documentation Index

​What you can see

​In-stream observability

​usage

​tool_call and tool_result

​compaction_event

​heartbeat

​Per-turn metrics to capture

​The observability_events table

​The raven query API

​Engine logs

​Tracing

​What to alert on

​What to dashboard

​See also

What you can see

In-stream observability

`usage`

`tool_call` and `tool_result`

`compaction_event`

`heartbeat`

Per-turn metrics to capture

The `observability_events` table

The `raven` query API

Engine logs

Tracing

What to alert on

What to dashboard

See also