Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

The simplest thing the Engine does is generate text. You send a message; the agent thinks; you get text back. Everything else — tool use, memory, permissions — is built on top of that primitive. This page explains the text-generation surface, the parameters you control, and the patterns that work.

The smallest call

curl -N -X POST "$ENGINE_URL/execute" \
  -H "Content-Type: application/json" \
  -H "X-Engine-Key: $ENGINE_KEY" \
  -d '{
    "message": "Write a one-sentence definition of leverage.",
    "task_id": "demo-text-001"
  }'
The response streams back as text_delta events. Concatenate them in order to get the full reply. See quickstart-curl for how to consume the stream.

What controls the output

Three things shape the text you get back: the model, the system prompt, and the conversation history.

Model

The model is set at the deployment level via LLM_PROVIDER and LLM_MODEL. Different models give different output:
  • Claude Sonnet 4.5 / 4.6 / 4.7 — long-form writing, careful reasoning, good default for production agents.
  • GPT-5.x — strong tool use, structured output, faster than Sonnet at similar quality on many tasks.
  • Gemini 2.x Pro — long context, multimodal, cost-efficient at scale.
You don’t pick the model per call; you pick it per Engine instance. To serve users on different models, run different Engines.

System prompt

The system prompt comes from the agent definition the Engine is running. A typical agent prompt is a few hundred words: who the agent is, what voice it uses, what tools it has, what it should refuse. The Engine ships a small catalog of stock agents (analyst, coder, researcher, writer); you can add custom ones. The agent’s prompt is invisible to the caller — you don’t pass it on each call. It’s loaded from the catalog at boot.

Conversation history

Every call with the same task_id continues the same thread. The Engine prepends the prior turns automatically. This is why “remember when we talked about X yesterday” works — the relevant past is retrieved from memory and slotted into context before the model is called. To start fresh, use a new task_id.

Output length

The Engine sends MAX_TOKENS (default 4096) as the maximum output length. Long-form work hits this ceiling routinely. Three options:
  1. Raise MAX_TOKENS. The most direct fix. Watch your costs — output tokens are usually 4–5× more expensive than input tokens.
  2. Continue the call. When the model returns stop_reason: max_tokens, send a follow-up turn asking it to continue. The Engine treats this like any other turn.
  3. Token budget continuation. Set TOKEN_BUDGET_ENABLED=true to let the Engine continue automatically until the model stops on its own terms. Useful for documents that don’t fit in one round-trip.

Streaming patterns

Text comes back as text_delta events. A few patterns to know:

Simple concatenation

For a single text reply, accumulate text_delta.text in order and render. This is what every quickstart shows.

Block-aware rendering

A turn can emit text from multiple blocks (the model thought, called a tool, then resumed text). Use content_block_start and content_block_stop to keep blocks separate in your UI:
current_block = None

for event in stream:
    if event.type == "content_block_start":
        current_block = {"index": event.block_index, "text": ""}
    elif event.type == "text_delta":
        current_block["text"] += event.text
        ui.render_block(current_block)
    elif event.type == "content_block_stop":
        ui.finalize_block(current_block)
        current_block = None

Thinking vs. text

If the model emits thinking blocks, they arrive as thinking_delta events. You probably want to render them in a different region of the UI (or hide them entirely). Treat them like text_delta — concatenate in order — but route them to a different sink.

Stop reasons

Every turn ends with the model emitting one of:
Stop reasonWhat it means
end_turnThe model believes the answer is complete.
stop_sequenceThe model hit a configured stop sequence.
max_tokensOutput ceiling. Continue if you want more.
tool_use(Internal — the model wants to call a tool. The loop continues.)
pause_turnThe model wants to pause and resume. The Engine handles this.
You don’t see tool_use or pause_turn as turn-end signals from outside the Engine — the loop handles them and keeps going.

Patterns

Short replies

For chat-style interactions, set the system prompt to bias short replies (Respond in one or two sentences unless the user asks for more) and keep MAX_TOKENS low (256–512). Cheaper, faster, less filler.

Long-form

For document drafting, set MAX_TOKENS high (8K–16K) or enable TOKEN_BUDGET_ENABLED=true. Use the LIGHT_MODEL for compaction so the heavy model spends its tokens on the document, not on summaries.

Voice consistency

The agent’s voice comes from three places: the agent’s system prompt, the soul (which records “this user prefers concise” learned from past turns), and the recent conversation. To keep voice consistent across many sessions, write to the soul deliberately — that’s what the Learning Centre is for.

Pitfalls

  • Hallucinated facts. Plain text generation is the most prone to hallucination. If accuracy matters, give the agent a tool to look the fact up. Don’t rely on the model’s training cutoff.
  • Verbose preambles. Many models start with “Sure, I can help with that…” Strip these in your system prompt: Skip preambles..
  • Drift on long threads. Around 50+ turns, voice and follow-through start to slip. Compact, restart, or split the task.

See also