Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
This guide builds a research agent end to end. The agent takes a
research question, searches the web, reads sources, and produces a
structured summary with citations. By the end you’ll have a working
agent and a complete trace.
What we’re building
A research agent that:
- Takes a topic or question.
- Plans the research (what to search for).
- Searches the web for primary sources.
- Reads the most relevant pages.
- Synthesizes a structured summary.
- Returns the summary with citations.
The agent uses web_search, web_fetch, and a custom
submit_research_report tool for the structured output.
catalog/agents/researcher/agent.json:
{
"name": "researcher",
"description": "An analyst who researches a topic, reads primary sources, and produces a cited summary.",
"model": "claude-sonnet-4-7",
"tools": [
"web_search",
"web_fetch",
"memory_episode_write",
"submit_research_report"
],
"system_prompt_path": "system-prompt.md"
}
catalog/agents/researcher/system-prompt.md:
You are a research analyst. The user gives you a question; you produce a
clear, well-sourced answer.
## How to operate
1. Plan. State (in 1-2 sentences) what you're going to look up and why.
2. Search. Use web_search with focused queries. Don't search for the
exact user question; search for the things that would answer it.
3. Read. Pick the most relevant 2-4 results. Use web_fetch to read them.
4. Synthesize. Combine what you read into a structured summary.
5. Cite. Every claim should reference a specific source URL.
6. Submit. Call submit_research_report with the structured result. Do
not write the report as text in your response.
## What good looks like
- Specific over vague. "The 2024 Q3 report cited 12.4% growth" beats
"the company is growing."
- Cited. Every factual claim links to a source.
- Hedged appropriately. If a source is uncertain, say so.
- Brief. Three to five key points beats fifteen mediocre ones.
## What bad looks like
- Hallucinations. If you don't have a source, don't claim the fact.
- Marketing language. "Industry-leading," "groundbreaking" — cut.
- Search-result noise. Don't paste raw snippets; synthesize.
catalog/tools/submit_research_report.json:
{
"name": "submit_research_report",
"description": "Submit the final structured research report. Call this when synthesis is complete. Do not write the report as text.",
"input_schema": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "A specific, descriptive title."
},
"summary": {
"type": "string",
"description": "Two to four sentences summarizing the answer."
},
"key_points": {
"type": "array",
"items": {
"type": "object",
"properties": {
"claim": { "type": "string" },
"source_url": { "type": "string", "format": "uri" },
"confidence": {
"type": "string",
"enum": ["high", "medium", "low"]
}
},
"required": ["claim", "source_url", "confidence"]
}
},
"open_questions": {
"type": "array",
"items": { "type": "string" },
"description": "Things the research didn't answer."
}
},
"required": ["title", "summary", "key_points"]
},
"implementation": {
"kind": "skill",
"ref": "submit_research_report"
}
}
The tool is implemented as a skill that just records the input — your
application reads the structured payload from the tool_call.input.
The researcher needs a web-search key:
BRAVE_API_KEY=...
TAVILY_API_KEY=... # fallback
Reload the Engine.
Step 4 — Send the first request
curl -N -X POST "$ENGINE_URL/execute" \
-H "Content-Type: application/json" \
-H "X-Engine-Key: $ENGINE_KEY" \
-d '{
"message": "What are the most cited differences between Anthropic Claude Sonnet 4.7 and OpenAI GPT-5.5 for agentic tool use?",
"task_id": "research-agentic-comparison-001"
}'
The stream:
event: thread_lifecycle
data: {"phase":"started"}
event: text_delta
data: {"text":"I'll look up benchmarks and developer reports comparing
these two models on agentic tasks."}
event: tool_call
data: {"tool":"web_search","input":{"query":"Claude Sonnet 4.7 vs GPT-5.5 benchmark agentic tool use 2026"}}
event: tool_result
data: {"output":[{"title":"...","url":"...","snippet":"..."}, ...]}
event: tool_call
data: {"tool":"web_fetch","input":{"url":"https://..."}}
event: tool_result
data: {"output":"...page content..."}
[several rounds of search and fetch]
event: tool_call
data: {"tool":"submit_research_report","input":{
"title":"Sonnet 4.7 vs GPT-5.5 on Agentic Tool Use",
"summary":"Both models perform similarly on simple tool use...",
"key_points":[
{"claim":"...","source_url":"https://...","confidence":"high"},
...
],
"open_questions":["..."]
}}
event: tool_result
data: {"output":"submitted"}
event: text_delta
data: {"text":"Research complete. The structured report has been
submitted."}
event: thread_lifecycle
data: {"phase":"completed"}
Your application reads the submit_research_report tool’s input from
the tool_call event — that’s your structured payload, ready to display
or store.
Step 5 — Storing structured output
In your application:
def consume_research_stream(stream):
report = None
for event_type, data in stream:
if event_type == "tool_call" and data["tool"] == "submit_research_report":
report = data["input"]
break
return report
report = consume_research_stream(execute(message, task_id))
# report is now the structured object
print(report["title"])
for p in report["key_points"]:
print(f"- {p['claim']} ({p['source_url']})")
Step 6 — Memory for follow-ups
When the user asks a follow-up question on the same topic, the agent
should remember what it already found. The Engine handles this
automatically:
- Same
task_id → conversation history includes the prior research.
- Different
task_id → the Learning Centre may have surfaced relevant
episodes during retrieval.
You can also write episodes explicitly during the run by giving the
agent a memory_episode_write tool. The agent records “during research
on X, the most authoritative source was Y” — useful for next time.
Step 7 — Add evals
{
"id": "researcher-eval-001",
"category": "research",
"description": "Agent should produce a structured report with cited sources.",
"input": {
"message": "What are the most cited differences between Sonnet 4.7 and GPT-5.5 for agentic tool use?"
},
"expected": {
"must_call_tools": ["web_search", "submit_research_report"],
"structured_output_schema": "research_report",
"min_key_points": 3,
"all_key_points_have_sources": true,
"max_tokens": 6000
}
}
Custom scorers check that every key_points[i].source_url is a valid
URL and that the structured output validates against your schema.
Improvements worth making
Citation verification
Add a verify_url tool that fetches the URL and confirms the cited
text appears in the page. Catches the model hallucinating sources that
sound plausible.
Multi-source corroboration
For high-stakes claims, require the agent to cite at least two
independent sources. Adjust the system prompt:
For high-confidence claims, cite at least two independent sources. If
only one source supports a claim, downgrade confidence to medium and
note "single-source."
Long-context Gemini for source-heavy queries
When the user’s question requires reading many long sources, route to
Gemini 2.5 Pro instead of Sonnet:
- Run two Engine instances.
- Switch upstream based on query complexity.
Domain-specific search tools
For research within a specific domain (legal, medical, financial),
swap web_search for a domain-specific search tool that hits
authoritative databases. The agent’s flow stays the same; the inputs
get better.
Pitfalls
- Hallucinated citations. The model can confidently cite URLs that
don’t exist. The
verify_url tool above is the durable fix.
- Search loops. The agent searches, doesn’t find what it wants,
searches again with a slight variation. Cap iterations:
“After 3 searches, commit to what you have or ask the user.”
- One-shot summary masquerading as research. Without explicit
instruction to call
submit_research_report, the model will write
the report as text. The system prompt must enforce the tool call.
- Stale information. Search returns old pages. For
time-sensitive questions, prefer searches with date filters or
domain-specific tools.
See also