Safety in an agentic system covers more than the model’s text output. The agent runs commands, writes files, calls APIs, sends messages. Every one of those is a potential way to cause harm. This page covers the safety surface for building on the Engine and the patterns that hold the line.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What can go wrong
The harms that come up in practice:- Destructive actions. The agent deletes the wrong file, drops the wrong table, sends the wrong message.
- Data exfiltration. Secrets in the model’s output. Sensitive data in tool calls to third-party services.
- Privilege escalation. The agent uses one capability to gain another it shouldn’t have.
- Prompt injection. Untrusted content (web pages, emails, MCP responses) tricks the model into ignoring its instructions.
- Refusal failures. The agent helps with something it shouldn’t.
- Reputation harm. The agent says something offensive, factually wrong, or off-brand.
What the Engine does for you
Sandboxing
Every command run inside the Engine goes through bwrap + seccomp + landlock. The agent can’t escape the configured filesystem, can’t execute arbitrary syscalls, can’t read environment variables it shouldn’t see. See Code execution.Permission prompts
The static permission policy (src/sandbox/permissions.py) intercepts
dangerous operations before they run and surfaces them as HITL prompts.
The user — not the model — decides. Categories:
- Destructive shell (
rm -rf,shred,chown). - Writes outside
ALLOWED_ROOTS. - Network operations (configurable).
- High-cost actions.
Secret scanning
The Engine’s secret scanner (src/security/secret_scanner.py) runs
over outbound content and scrubs likely API keys, tokens, and
credentials before the SSE stream emits them. Defense-in-depth: don’t
let secrets show up in agent output even if they leak into context.
Encrypted credentials
MCP credentials live in the Asset Directory, encrypted at rest with Fernet usingAD_ENCRYPTION_KEY. They never appear in the prompt, the
SSE stream, or logs. They’re only decrypted when the Engine makes an
outbound call to the MCP server.
Content sanitization
The Unicode sanitization middleware strips zero-width characters, right-to-left overrides, and other Unicode tricks that get used in prompt-injection attacks. Inputs and outputs both go through it.What the Engine doesn’t do
Refusal policy
The Engine doesn’t enforce a refusal policy. The model has its own trained-in refusals; you can extend them via the system prompt; the Engine itself is not a content filter. If you need stronger refusal behavior, build it at the agent layer:Multi-tenant isolation
The Engine is single-tenant by design. Each instance has one brain. Multi-user products run multiple Engines and route at the upstream layer. Don’t try to use one Engine instance across users — there’s no isolation for that.Network egress filtering
The default deployment doesn’t restrict outbound network destinations. For deployments that need that, add iptables rules viascripts/firewall.sh.
Input validation at your application layer
The Engine validates against its own schemas. You should validate against your business rules upstream — “this user shouldn’t be able to ask about Y,” “messages over 10K characters get rejected” — before sending to the Engine.Patterns
Confirm before destructive
Refuse fast
For categories the agent shouldn’t help with, refuse in the system prompt, not after deliberation. Otherwise the model writes a long explanation of why it’s refusing, which leaks information about its prompt and policy.Treat tool output as data, not instructions
This is the core defense against prompt injection. Add to the system prompt:Constrain credentials
Each MCP connector has scopes. Request the narrowest scopes needed. A Slack bot that posts to one channel doesn’t needchat:write across
the workspace.
Log for audit
For multi-user deployments, log every tool call with:- Task ID.
- Tool name.
- Input (redacted as needed).
- Result code.
- Timestamp.
Threat model
The internal threat model is at Threat model. It covers:- The trust boundaries.
- The adversaries (malicious user, malicious model output, malicious MCP server).
- The defenses by layer.
- The residual risks we accept.
What to do when something goes wrong
If you suspect the agent did something it shouldn’t have:- Pause the user’s task. Cancel
/execute; don’t let it continue. - Capture the trajectory. The brain has a
trajectoriesrow for the task. Pull it before it gets compacted into long-term memory. - Pause the same agent for other users if the issue is policy-level.
- Investigate. Trajectory + logs + the MCP server’s logs (if relevant). Find the root cause.
- Patch the agent’s prompt or policy. Add a refusal or a permission check.
- Add a regression eval. Make sure this exact scenario can’t happen again.
- Tell affected users if anything material happened.
See also
- Threat model — the full internal model.
- Permissions — how the permission system gates risky operations.
- Code execution — sandbox guarantees.
- Error handling — handling refusals and permission denials.

