SLIs (what we measure)
Availability
- 4xx (caller errors).
- 503 from migration-required (operational, not failure).
error events count
as 5xx for this purpose if the agent never sent a thread_lifecycle: completed.
Latency (request)
Latency (turn)
Cost
Cache hit ratio
SLOs (what we commit to)
Per-month windows.| SLO | Target |
|---|---|
| Availability | 99.5% |
| TTFT p99 | < 5s |
| Turn p99 | < 60s |
Error budgets
The error budget is the difference between 100% and the SLO target.| SLO | Target | Error budget per month |
|---|---|---|
| Availability | 99.5% | 0.5% (3.6 h) |
| TTFT p99 | < 5s | n/a (latency, not availability) |
- Deploy freely.
- Take risks on infrastructure changes.
- Run experimental Engines in parallel.
- Slow down deploys; require canary.
- Defer non-critical infrastructure work.
- Investigate what’s burning the budget.
- Stop non-critical deploys.
- All hands on reliability.
- Cool-down period until the next monthly window.
Burn-rate alerts
The error budget burns at a rate dependent on how bad the failures are. We alert on burn rate, not just absolute error rate, because a 2% error rate sustained for an hour burns budget at a different rate than 100% for a minute.| Window | Burn rate threshold | Severity |
|---|---|---|
| 1 hour | 14× normal | Sev1 |
| 6 hours | 6× normal | Sev2 |
| 24 hours | 3× normal | Sev3 |
Per-customer SLOs
For deployments with multiple customers, per-customer SLOs may apply above the global ones. A specific customer might have:- Tighter availability (99.9%).
- Tighter latency (TTFT < 3s p99).
- Reserved capacity (their Engine never shares resources).
Scope
These SLOs cover the Engine itself. They do not cover:- LLM provider availability. We can’t promise what we don’t control.
- MCP server availability. Each connector has its own SLO with its own provider.
- Network availability between customer and Engine. That’s the customer’s network and ours, jointly.
Reporting
Per-month SLO compliance is reported in the engineering review:See also
- Observability — how we measure these.
- Postmortems — what comes after misses.

