The simplest thing the Engine does is generate text. You send a message; the agent thinks; you get text back. Everything else — tool use, memory, permissions — is built on top of that primitive. This page explains the text-generation surface, the parameters you control, and the patterns that work.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
The smallest call
text_delta events. Concatenate them in
order to get the full reply. See
quickstart-curl for how to consume the
stream.
What controls the output
Three things shape the text you get back: the model, the system prompt, and the conversation history.Model
The model is set at the deployment level viaLLM_PROVIDER and
LLM_MODEL. Different models give different output:
- Claude Sonnet 4.5 / 4.6 / 4.7 — long-form writing, careful reasoning, good default for production agents.
- GPT-5.x — strong tool use, structured output, faster than Sonnet at similar quality on many tasks.
- Gemini 2.x Pro — long context, multimodal, cost-efficient at scale.
System prompt
The system prompt comes from the agent definition the Engine is running. A typical agent prompt is a few hundred words: who the agent is, what voice it uses, what tools it has, what it should refuse. The Engine ships a small catalog of stock agents (analyst, coder, researcher, writer); you can add custom ones. The agent’s prompt is invisible to the caller — you don’t pass it on each call. It’s loaded from the catalog at boot.Conversation history
Every call with the sametask_id continues the same thread. The Engine
prepends the prior turns automatically. This is why “remember when we
talked about X yesterday” works — the relevant past is retrieved from
memory and slotted into context before the model is called.
To start fresh, use a new task_id.
Output length
The Engine sendsMAX_TOKENS (default 4096) as the maximum output
length. Long-form work hits this ceiling routinely. Three options:
- Raise
MAX_TOKENS. The most direct fix. Watch your costs — output tokens are usually 4–5× more expensive than input tokens. - Continue the call. When the model returns
stop_reason: max_tokens, send a follow-up turn asking it to continue. The Engine treats this like any other turn. - Token budget continuation. Set
TOKEN_BUDGET_ENABLED=trueto let the Engine continue automatically until the model stops on its own terms. Useful for documents that don’t fit in one round-trip.
Streaming patterns
Text comes back astext_delta events. A few patterns to know:
Simple concatenation
For a single text reply, accumulatetext_delta.text in order and render.
This is what every quickstart shows.
Block-aware rendering
A turn can emit text from multiple blocks (the model thought, called a tool, then resumed text). Usecontent_block_start and
content_block_stop to keep blocks separate in your UI:
Thinking vs. text
If the model emits thinking blocks, they arrive asthinking_delta
events. You probably want to render them in a different region of the UI
(or hide them entirely). Treat them like text_delta — concatenate in
order — but route them to a different sink.
Stop reasons
Every turn ends with the model emitting one of:| Stop reason | What it means |
|---|---|
end_turn | The model believes the answer is complete. |
stop_sequence | The model hit a configured stop sequence. |
max_tokens | Output ceiling. Continue if you want more. |
tool_use | (Internal — the model wants to call a tool. The loop continues.) |
pause_turn | The model wants to pause and resume. The Engine handles this. |
tool_use or pause_turn as turn-end signals from outside
the Engine — the loop handles them and keeps going.
Patterns
Short replies
For chat-style interactions, set the system prompt to bias short replies (Respond in one or two sentences unless the user asks for more) and
keep MAX_TOKENS low (256–512). Cheaper, faster, less filler.
Long-form
For document drafting, setMAX_TOKENS high (8K–16K) or enable
TOKEN_BUDGET_ENABLED=true. Use the LIGHT_MODEL for compaction so
the heavy model spends its tokens on the document, not on summaries.
Voice consistency
The agent’s voice comes from three places: the agent’s system prompt, the soul (which records “this user prefers concise” learned from past turns), and the recent conversation. To keep voice consistent across many sessions, write to the soul deliberately — that’s what the Learning Centre is for.Pitfalls
- Hallucinated facts. Plain text generation is the most prone to hallucination. If accuracy matters, give the agent a tool to look the fact up. Don’t rely on the model’s training cutoff.
- Verbose preambles. Many models start with “Sure, I can help with
that…” Strip these in your system prompt:
Skip preambles.. - Drift on long threads. Around 50+ turns, voice and follow-through start to slip. Compact, restart, or split the task.
See also
- Streaming — patterns for consuming the SSE stream.
- Cost and latency — output tokens are where money goes.
- Prompt engineering — what to put in the system prompt.

