The Engine has 110 test files covering the agent loop, sandbox, memory, LLM integration, asset directory, learning centre, security, and the edges between them. This page covers how the test suite is organized, the rules we follow when adding tests, and how to run them.Documentation Index
Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The pyramid
We test at three levels, with deliberate weight at each:| Level | Count (approx) | What |
|---|---|---|
| Unit | ~60% | Pure functions, single classes, narrow behavior. |
| Integration | ~30% | Real DB, real sandbox, real services. |
| End-to-end | ~10% | Full /execute runs against real LLM providers. |
Where tests live
conftest.py
The session conftest provides:- A
dbfixture that creates a temporary SQLite database, loads sqlite-vec, runs migrations, and yields a connection pool. - A seccomp probe that checks whether the test environment supports the BPF filters the sandbox uses. Tests that require seccomp are skipped on incompatible hosts (notably some QEMU-emulated environments).
- Async-mode configuration (
asyncio_mode: strict).
Rules
No mocks for LLM calls
We do not mock LLM provider responses. Tests that exercise the agent loop hit real model APIs. This is a deliberate constraint inherited from the project’s CLAUDE.md. The rationale: mocked LLM responses pass the test and fail in production at the worst possible moment. If a test requires a model, it makes a real call. The cost is real (in dollars and time). We mitigate by:- Pinning to cheap models (Haiku-class) for tests that don’t need the strong model.
- Caching where possible.
- Running expensive tests less frequently (nightly, not per-PR).
Real database
The shareddb fixture creates a real SQLite + sqlite-vec instance
per session. Tests against memory, asset directory, and any persisted
state hit real SQL. No mock query builders, no faked rows.
Real sandbox
Sandbox tests run realbubblewrap. The test container is privileged
because bwrap requires capabilities the default Docker security
profile doesn’t grant.
One assert per concept
Tests with five unrelated asserts hide which behavior actually broke. Split into multiple tests when the asserts cover different concepts.Descriptive names
Running tests
Full suite
Single file
By name
Without LLM dependencies
requires_llm is added to tests that hit a provider. Run
without them when the API key is unavailable or when iterating on
non-LLM logic.
With coverage
- New code: 80%+ (enforced in CI).
- Critical paths (agent loop, sandbox, permissions): 95%+.
- Total: 75%+ (informational).
Stress tests
stress_*.py files run for minutes and exercise high-concurrency or
large-data paths. They’re not in the regular suite — they run
nightly:
stress_agent_loop_concurrency.py— many concurrent/executecalls.stress_memory_search.py— large brain, fast retrieval.stress_compaction.py— long contexts, repeated compaction.
Test data
Test fixtures live alongside the tests. We don’t ship a “test data” directory; each test that needs setup creates it inline. This keeps tests independent. For shared setup (a brain with N episodes), the fixture is a function in conftest.py that builds the state programmatically. Don’t commit SQLite blobs.When to add a test
| Situation | Add a test? |
|---|---|
| Fixed a bug | Yes. Encode the bug as a test that fails before, passes after. |
| Added a new endpoint | Yes. Smoke test plus at least one error-path test. |
| Added a new tool | Yes. Test that the tool registers, runs, and returns valid output. |
| Refactored without behavior change | No. The existing tests should still pass; that’s the proof. |
| Added an internal helper | Sometimes. If the helper has non-trivial logic, yes. |
| Tweaked a prompt | No (this is what evals are for, see Evaluation). |
Flaky tests
A test that passes some runs and fails others is a problem. Two paths:- Fix the flakiness. Usually it’s a race or a timing assumption.
- Quarantine it. Mark
@pytest.mark.flakyand stop blocking on it. Flag for investigation; don’t normalize tolerating flakes.
CI
CI runs on every PR:- The full suite (in privileged mode, with LLM keys).
- Coverage report.
- Lint and formatting.
- Full suite.
- Stress tests.
- Long-running integration tests.
See also
- Evaluation — testing agent behavior, distinct from testing code.
- Eval harness — the runner for behavior tests.
- Common tasks — shortcuts for running tests locally.

