The context window is your most expensive resource. Every token in context costs money, eats latency, and competes with every other token for the model’s attention. This page explains how the Engine assembles context, what you can do to keep it sharp, and what to do when it grows beyond what one round-trip can hold.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What the model sees
When the agent loop calls the model on a turn, it assembles context in this order:- The agent’s system prompt. The persistent instructions for this agent.
- The soul. The user’s core identity and learned patterns.
- Retrieved long-term memory. Relevant episodes, knowledge facts, and social-graph entries pulled by similarity to the latest message.
- Conversation history. Past turns in this task, including the tool calls and results.
- Working memory. Notes, observations, and summaries the agent has accumulated within this task.
- Tool definitions. The catalog of tools the agent can call.
- The current user message.
cached_snip heuristics do behind the
scenes.
Long-term vs. working memory
Two different stores serve two different purposes:| Long-term | Working | |
|---|---|---|
| Scope | Across tasks, across sessions | Within one task |
| Lifetime | Indefinite (with confidence decay) | TTL-bound |
| Population | Learning Centre batches; explicit writes | Agent writes during execution |
| Retrieval | Semantic + keyword search | Direct read |
| Cost | One memory call per turn | Inline in context |
Designing for retrieval
The agent’s long-term memory only helps if the right thing comes back when the agent searches it. A few patterns:Write structured episodes
When you write to episodic memory, give the entry a clear key, a description, and a value:"description": "User said something about databases." They don’t surface in search.
Use knowledge facts for invariants
Knowledge facts are typed assertions with confidence and validity. Useful for things that are true now and may not be true forever:valid_until) and
write a new one. Don’t edit in place.
Don’t write low-value entries
Memory you write but never retrieve is overhead. The Learning Centre batches are designed to filter noise, but explicit writes from your application bypass that filter. Be selective.Compaction
When context grows beyond the safe limit, the compaction orchestrator (compaction_orchestrator.py) collapses earlier turns. Compaction is
on by default (REACTIVE_COMPACT_ENABLED=true).
Two strategies:
Summary collapse
The default. Older turns get replaced with a summary block: “Earlier in this task, we read 5 files, ran tests (passed), and discussed the authentication redesign.” The model knows roughly what happened without the full transcript.Structured file extraction
WhenFEATURE_STRUCTURED_FILE_EXTRACTION=true (the default), file
contents the agent has read are extracted and preserved separately so
the summary collapse doesn’t drop them. The agent can re-reference a
file’s content without re-reading the file.
Designing for compaction
You don’t usually call compaction yourself, but you do design for it:- Short tool outputs win. A tool that returns 50 KB of HTML survives compaction less gracefully than one that returns a 2 KB summary. Prefer summarizing at the tool layer.
- Stable file references. When the agent reads a file, the path becomes the durable reference. Renaming files mid-task breaks the re-reference pattern.
- Working-memory writes. When the agent decides “this is important,” it writes to working memory. Working-memory entries survive compaction better than raw turn content.
Cache-friendly structure
Prompt caching is the single biggest cost lever. To benefit:- Keep the system prompt stable. Don’t include per-call data.
- Order context oldest → newest. The Engine does this; don’t fight it.
- Avoid unnecessary tool catalog churn. Reloading the catalog invalidates cache entries.
usage event with cache_hit_tokens for every model
call. Watch it.
Handling very long tasks
Some tasks naturally run for hundreds of turns — coding sessions, research projects, customer support threads. Patterns:Compact at task boundaries
When a logical phase ends (“we finished the diagnosis, now we’re moving to the fix”), trigger an explicit summarization. This gives the agent a clean handoff between phases.Split tasks
If a “task” is really three different jobs in sequence, give them differenttask_ids. Memory carries forward via long-term memory; the
context for each task stays focused.
Use sub-agents
A planner agent that delegates to sub-agents (coder, tester, etc.) can keep each sub-agent’s context narrow. The planner has the high-level view; sub-agents see only what they need. See Multi-agent patterns.What not to do
- Don’t put long static content in the system prompt. Use retrieval or tool output. The system prompt should be instructions, not knowledge.
- Don’t pass user history in
message. The Engine maintains it. Sending past turns inmessagedoubles them. - Don’t disable compaction in production. When context overflows, the model returns nothing useful. Compaction is the recovery path.
- Don’t write to memory on every turn. The Learning Centre handles the firehose. Explicit writes should be deliberate, not reflexive.
See also
- Compaction in components — the implementation.
- Cost and latency — the full picture of where context tokens go.
- Memory endpoints — read and write memory directly.

