The Engine can ingest more than text. Images, screenshots, PDFs, and other media flow through theDocumentation Index
Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
media field on /execute and through
content ingestion. This page covers what’s supported, what to set, and
how to handle the inevitable edge cases.
What you can send
| Media | Supported on | Notes |
|---|---|---|
| Images (PNG, JPEG, WebP, GIF) | Anthropic, OpenAI, Gemini | All providers handle base64. Some accept URLs. |
| Documents (PDF) | Anthropic (native), others (via extraction) | Anthropic handles PDFs as a media type. Other providers need a text-extraction step. |
| Audio | Gemini, OpenAI (limited) | Use cautiously; provider support varies. |
| Video | Gemini | Long-form support; provider-specific limits. |
Sending an image
Inline the image as a base64 data URL or pass a public URL:Limits and feature flags
Several env vars control how the Engine handles media. All of them are"true" / "false" strings.
| Flag | Default | Purpose |
|---|---|---|
MEDIA_PRESEND_VALIDATION_ENABLED | "false" | Validate size/content before sending to the provider. Saves a round-trip on bad inputs. |
MEDIA_IMAGE_RESIZE_ENABLED | "false" | Resize oversized images via Pillow before sending. Useful for cost control. |
MEDIA_ERROR_RECOVERY_ENABLED | "false" | Catch provider errors caused by media, strip the offending block, retry. |
CONTENT_INGESTION_ENABLED | "false" | Route large media into the ingestion_chunks table for tool-mediated retrieval rather than dropping into context. |
Image size guidance
Each provider has its own limits, but as a working rule:- Aim for the longest edge to be under 2048 pixels.
- Aim for the file size to be under 5 MB (some providers cap at 5, some at 20).
- For dense documents, send each page as a separate image rather than a single tall composite.
What works well
- Charts and plots. Models do excellent OCR on labeled axes.
- Code on screen. Screenshots of code or terminal output are reliably read.
- Whiteboards and sketches. With a clear photo, models can read hand-drawn diagrams.
- Document pages. PDFs (Anthropic native) or page-by-page images.
What doesn’t
- Tiny text. Below ~12px effective height, OCR degrades.
- Stylized fonts. Decorative typefaces confuse OCR.
- Color-coded information. Models report colors but reason about them less reliably than positions or labels.
- Counting many small objects. Anything beyond ~10 distinct items in one image becomes unreliable.
Content ingestion (large media)
For media that’s too large to drop into context — long PDFs, video transcripts, multi-image gallery sets — use content ingestion instead of inlinemedia.
When CONTENT_INGESTION_ENABLED=true, large media is chunked, embedded,
and stored in ingestion_chunks. The agent then has tools to retrieve
relevant chunks on demand:
This pattern is essential for any deployment that handles document
analysis at scale.
Multimodal output
Today, the Engine streams text and thinking. It does not generate images or audio. If you need image generation, route through a separate tool (an MCP connector to a generation API, or a custom platform tool that calls the provider’s image API) and surface the result as a media block in a follow-up turn.Pitfalls
- Forgetting the
media_type. Required for base64 sources. Without it the provider returns a generic error. - Mixing URL and base64 in one block. Pick one source type per block.
- Sending the same image twice in a thread. Wastes tokens and money. Reference past turns instead — “in the chart you saw earlier…”
- Assuming the model “remembers” images across turns. It does, for as long as they’re in context. Compaction can drop them. If the conversation is long, re-attach the relevant image when needed.
- PII in screenshots. Screenshots often contain more than the user intended (chat panes, browser tabs). For sensitive deployments, apply image redaction before send.
See also
- POST /execute — the
mediafield reference. - Environment variables — media flags.
- Cost and latency — image tokens are expensive; size matters.

