SLO definitions

Service Level Objectives (SLOs) are the reliability commitments we make to ourselves and to customers. This page defines the SLIs we measure (the indicators), the SLOs we aim for (the targets), and the error budgets that drive deploy decisions. The numbers below are starting points. As we accumulate production data, they’ll firm up — what’s “achievable but not free” today may become “easy” or “impossible” tomorrow.

SLIs (what we measure)

Availability

SLI: 1 - (5xx responses on /execute) / (all responses on /execute)

Excludes:

4xx (caller errors).
503 from migration-required (operational, not failure).

A 5xx is anything between 500 and 599. Streamed error events count as 5xx for this purpose if the agent never sent a thread_lifecycle: completed.

Latency (request)

SLI: time from POST /execute to first text_delta event (TTFT)
SLO targets:
  p50 < 1.5s
  p99 < 5s

Time-to-first-token (TTFT). The user’s perception of “the agent is working” starts here.

Latency (turn)

SLI: time from POST /execute to thread_lifecycle: completed
SLO targets:
  p50 < 8s
  p99 < 60s

Total turn time. Some turns are legitimately long (multi-step coding sessions); the p99 is the right tail to watch.

Cost

SLI: cost per turn = (input_tokens × input_price + output_tokens × output_price) / turn
SLO targets:
  p50 < $0.05
  p99 < $0.50

Soft target — informational, not a deploy gate. But P99 spikes are worth investigating.

Cache hit ratio

SLI: sum(cache_hit_tokens) / sum(input_tokens)
SLO target: > 60% across all turns over 24 h

A drop in cache hit ratio is a leading indicator of cost spikes and behavioral drift.

SLOs (what we commit to)

Per-month windows.

SLO	Target
Availability	99.5%
TTFT p99	< 5s
Turn p99	< 60s

99.5% over a 30-day window allows ~3.6 hours of downtime. This is the right starting target — we’re not running 24/7 tier-1 support yet. When we onboard production customers with stricter SLAs, the target moves to 99.9% (which allows ~43 minutes/month).

Error budgets

The error budget is the difference between 100% and the SLO target.

SLO	Target	Error budget per month
Availability	99.5%	0.5% (3.6 h)
TTFT p99	< 5s	n/a (latency, not availability)

When the error budget is healthy:

Deploy freely.
Take risks on infrastructure changes.
Run experimental Engines in parallel.

When the error budget is < 50% used:

Slow down deploys; require canary.
Defer non-critical infrastructure work.
Investigate what’s burning the budget.

When the error budget is exhausted:

Stop non-critical deploys.
All hands on reliability.
Cool-down period until the next monthly window.

This is a self-imposed gate. It’s not magic; it’s discipline.

Burn-rate alerts

The error budget burns at a rate dependent on how bad the failures are. We alert on burn rate, not just absolute error rate, because a 2% error rate sustained for an hour burns budget at a different rate than 100% for a minute.

Window	Burn rate threshold	Severity
1 hour	14× normal	Sev1
6 hours	6× normal	Sev2
24 hours	3× normal	Sev3

A “1× normal” burn rate is the rate at which the budget would deplete exactly at the end of the month if it continued.

Per-customer SLOs

For deployments with multiple customers, per-customer SLOs may apply above the global ones. A specific customer might have:

Tighter availability (99.9%).
Tighter latency (TTFT < 3s p99).
Reserved capacity (their Engine never shares resources).

These are negotiated separately and tracked in the customer’s contract; they live in their own dashboards.

Scope

These SLOs cover the Engine itself. They do not cover:

LLM provider availability. We can’t promise what we don’t control.
MCP server availability. Each connector has its own SLO with its own provider.
Network availability between customer and Engine. That’s the customer’s network and ours, jointly.

For customer-facing SLAs, factor in upstream dependencies and appropriate cushions.

Reporting

Per-month SLO compliance is reported in the engineering review:

April 2026 SLO report
- Availability: 99.62% (target 99.5%) ✓
- TTFT p99: 4.8s (target < 5s) ✓
- Turn p99: 78s (target < 60s) ✗
- Error budget: 76% used

Notable: Turn p99 missed due to a regression in compaction (fixed
Apr 22). Action items in postmortem PM-2026-04-22.

Misses don’t trigger panic; they trigger investigation. The pattern matters more than any one month.

​SLIs (what we measure)

​Availability

​Latency (request)

​Latency (turn)

​Cost

​Cache hit ratio

​SLOs (what we commit to)

​Error budgets

​Burn-rate alerts

​Per-customer SLOs

​Scope

​Reporting

​See also