Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

This guide builds a research agent end to end. The agent takes a research question, searches the web, reads sources, and produces a structured summary with citations. By the end you’ll have a working agent and a complete trace.

What we’re building

A research agent that:
  • Takes a topic or question.
  • Plans the research (what to search for).
  • Searches the web for primary sources.
  • Reads the most relevant pages.
  • Synthesizes a structured summary.
  • Returns the summary with citations.
The agent uses web_search, web_fetch, and a custom submit_research_report tool for the structured output.

Step 1 — Configure the agent

catalog/agents/researcher/agent.json:
{
  "name": "researcher",
  "description": "An analyst who researches a topic, reads primary sources, and produces a cited summary.",
  "model": "claude-sonnet-4-7",
  "tools": [
    "web_search",
    "web_fetch",
    "memory_episode_write",
    "submit_research_report"
  ],
  "system_prompt_path": "system-prompt.md"
}
catalog/agents/researcher/system-prompt.md:
You are a research analyst. The user gives you a question; you produce a
clear, well-sourced answer.

## How to operate

1. Plan. State (in 1-2 sentences) what you're going to look up and why.
2. Search. Use web_search with focused queries. Don't search for the
   exact user question; search for the things that would answer it.
3. Read. Pick the most relevant 2-4 results. Use web_fetch to read them.
4. Synthesize. Combine what you read into a structured summary.
5. Cite. Every claim should reference a specific source URL.
6. Submit. Call submit_research_report with the structured result. Do
   not write the report as text in your response.

## What good looks like

- Specific over vague. "The 2024 Q3 report cited 12.4% growth" beats
  "the company is growing."
- Cited. Every factual claim links to a source.
- Hedged appropriately. If a source is uncertain, say so.
- Brief. Three to five key points beats fifteen mediocre ones.

## What bad looks like

- Hallucinations. If you don't have a source, don't claim the fact.
- Marketing language. "Industry-leading," "groundbreaking" — cut.
- Search-result noise. Don't paste raw snippets; synthesize.

Step 2 — Define the structured output tool

catalog/tools/submit_research_report.json:
{
  "name": "submit_research_report",
  "description": "Submit the final structured research report. Call this when synthesis is complete. Do not write the report as text.",
  "input_schema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "A specific, descriptive title."
      },
      "summary": {
        "type": "string",
        "description": "Two to four sentences summarizing the answer."
      },
      "key_points": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "claim": { "type": "string" },
            "source_url": { "type": "string", "format": "uri" },
            "confidence": {
              "type": "string",
              "enum": ["high", "medium", "low"]
            }
          },
          "required": ["claim", "source_url", "confidence"]
        }
      },
      "open_questions": {
        "type": "array",
        "items": { "type": "string" },
        "description": "Things the research didn't answer."
      }
    },
    "required": ["title", "summary", "key_points"]
  },
  "implementation": {
    "kind": "skill",
    "ref": "submit_research_report"
  }
}
The tool is implemented as a skill that just records the input — your application reads the structured payload from the tool_call.input. The researcher needs a web-search key:
BRAVE_API_KEY=...
TAVILY_API_KEY=...   # fallback
Reload the Engine.

Step 4 — Send the first request

curl -N -X POST "$ENGINE_URL/execute" \
  -H "Content-Type: application/json" \
  -H "X-Engine-Key: $ENGINE_KEY" \
  -d '{
    "message": "What are the most cited differences between Anthropic Claude Sonnet 4.7 and OpenAI GPT-5.5 for agentic tool use?",
    "task_id": "research-agentic-comparison-001"
  }'
The stream:
event: thread_lifecycle
data: {"phase":"started"}

event: text_delta
data: {"text":"I'll look up benchmarks and developer reports comparing
        these two models on agentic tasks."}

event: tool_call
data: {"tool":"web_search","input":{"query":"Claude Sonnet 4.7 vs GPT-5.5 benchmark agentic tool use 2026"}}

event: tool_result
data: {"output":[{"title":"...","url":"...","snippet":"..."}, ...]}

event: tool_call
data: {"tool":"web_fetch","input":{"url":"https://..."}}

event: tool_result
data: {"output":"...page content..."}

[several rounds of search and fetch]

event: tool_call
data: {"tool":"submit_research_report","input":{
  "title":"Sonnet 4.7 vs GPT-5.5 on Agentic Tool Use",
  "summary":"Both models perform similarly on simple tool use...",
  "key_points":[
    {"claim":"...","source_url":"https://...","confidence":"high"},
    ...
  ],
  "open_questions":["..."]
}}

event: tool_result
data: {"output":"submitted"}

event: text_delta
data: {"text":"Research complete. The structured report has been
        submitted."}

event: thread_lifecycle
data: {"phase":"completed"}
Your application reads the submit_research_report tool’s input from the tool_call event — that’s your structured payload, ready to display or store.

Step 5 — Storing structured output

In your application:
def consume_research_stream(stream):
    report = None
    for event_type, data in stream:
        if event_type == "tool_call" and data["tool"] == "submit_research_report":
            report = data["input"]
            break
    return report

report = consume_research_stream(execute(message, task_id))

# report is now the structured object
print(report["title"])
for p in report["key_points"]:
    print(f"- {p['claim']} ({p['source_url']})")

Step 6 — Memory for follow-ups

When the user asks a follow-up question on the same topic, the agent should remember what it already found. The Engine handles this automatically:
  • Same task_id → conversation history includes the prior research.
  • Different task_id → the Learning Centre may have surfaced relevant episodes during retrieval.
You can also write episodes explicitly during the run by giving the agent a memory_episode_write tool. The agent records “during research on X, the most authoritative source was Y” — useful for next time.

Step 7 — Add evals

{
  "id": "researcher-eval-001",
  "category": "research",
  "description": "Agent should produce a structured report with cited sources.",
  "input": {
    "message": "What are the most cited differences between Sonnet 4.7 and GPT-5.5 for agentic tool use?"
  },
  "expected": {
    "must_call_tools": ["web_search", "submit_research_report"],
    "structured_output_schema": "research_report",
    "min_key_points": 3,
    "all_key_points_have_sources": true,
    "max_tokens": 6000
  }
}
Custom scorers check that every key_points[i].source_url is a valid URL and that the structured output validates against your schema.

Improvements worth making

Citation verification

Add a verify_url tool that fetches the URL and confirms the cited text appears in the page. Catches the model hallucinating sources that sound plausible.

Multi-source corroboration

For high-stakes claims, require the agent to cite at least two independent sources. Adjust the system prompt:
For high-confidence claims, cite at least two independent sources. If
only one source supports a claim, downgrade confidence to medium and
note "single-source."

Long-context Gemini for source-heavy queries

When the user’s question requires reading many long sources, route to Gemini 2.5 Pro instead of Sonnet:
  • Run two Engine instances.
  • Switch upstream based on query complexity.

Domain-specific search tools

For research within a specific domain (legal, medical, financial), swap web_search for a domain-specific search tool that hits authoritative databases. The agent’s flow stays the same; the inputs get better.

Pitfalls

  • Hallucinated citations. The model can confidently cite URLs that don’t exist. The verify_url tool above is the durable fix.
  • Search loops. The agent searches, doesn’t find what it wants, searches again with a slight variation. Cap iterations: “After 3 searches, commit to what you have or ask the user.”
  • One-shot summary masquerading as research. Without explicit instruction to call submit_research_report, the model will write the report as text. The system prompt must enforce the tool call.
  • Stale information. Search returns old pages. For time-sensitive questions, prefer searches with date filters or domain-specific tools.

See also