Multimodal

The Engine can ingest more than text. Images, screenshots, PDFs, and other media flow through the media field on /execute and through content ingestion. This page covers what’s supported, what to set, and how to handle the inevitable edge cases.

What you can send

Media	Supported on	Notes
Images (PNG, JPEG, WebP, GIF)	Anthropic, OpenAI, Gemini	All providers handle base64. Some accept URLs.
Documents (PDF)	Anthropic (native), others (via extraction)	Anthropic handles PDFs as a media type. Other providers need a text-extraction step.
Audio	Gemini, OpenAI (limited)	Use cautiously; provider support varies.
Video	Gemini	Long-form support; provider-specific limits.

The Engine treats all of these as content blocks attached to the user’s message. The model sees them in context order.

Sending an image

Inline the image as a base64 data URL or pass a public URL:

curl -N -X POST "$ENGINE_URL/execute" \
  -H "X-Engine-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What does this chart show?",
    "task_id": "demo-img-001",
    "media": [
      {
        "type": "image",
        "source": {
          "type": "base64",
          "media_type": "image/png",
          "data": "iVBORw0KGgoAAAANSUhEUgAA..."
        }
      }
    ]
  }'

For URL sources:

{
  "type": "image",
  "source": {
    "type": "url",
    "url": "https://example.com/chart.png"
  }
}

URL sources are convenient but require the model provider to be able to fetch the URL. Base64 always works.

Limits and feature flags

Several env vars control how the Engine handles media. All of them are "true" / "false" strings.

Flag	Default	Purpose
`MEDIA_PRESEND_VALIDATION_ENABLED`	`"false"`	Validate size/content before sending to the provider. Saves a round-trip on bad inputs.
`MEDIA_IMAGE_RESIZE_ENABLED`	`"false"`	Resize oversized images via Pillow before sending. Useful for cost control.
`MEDIA_ERROR_RECOVERY_ENABLED`	`"false"`	Catch provider errors caused by media, strip the offending block, retry.
`CONTENT_INGESTION_ENABLED`	`"false"`	Route large media into the `ingestion_chunks` table for tool-mediated retrieval rather than dropping into context.

Production deployments that handle user-uploaded media should turn the first three on. The fourth is for high-volume document processing — see Content ingestion below.

Image size guidance

Each provider has its own limits, but as a working rule:

Aim for the longest edge to be under 2048 pixels.
Aim for the file size to be under 5 MB (some providers cap at 5, some at 20).
For dense documents, send each page as a separate image rather than a single tall composite.

If you have user-uploaded images that may be larger:

MEDIA_PRESEND_VALIDATION_ENABLED=true
MEDIA_IMAGE_RESIZE_ENABLED=true

The Engine will downscale before sending, transparently.

What works well

Charts and plots. Models do excellent OCR on labeled axes.
Code on screen. Screenshots of code or terminal output are reliably read.
Whiteboards and sketches. With a clear photo, models can read hand-drawn diagrams.
Document pages. PDFs (Anthropic native) or page-by-page images.

What doesn’t

Tiny text. Below ~12px effective height, OCR degrades.
Stylized fonts. Decorative typefaces confuse OCR.
Color-coded information. Models report colors but reason about them less reliably than positions or labels.
Counting many small objects. Anything beyond ~10 distinct items in one image becomes unreliable.

Content ingestion (large media)

For media that’s too large to drop into context — long PDFs, video transcripts, multi-image gallery sets — use content ingestion instead of inline media. When CONTENT_INGESTION_ENABLED=true, large media is chunked, embedded, and stored in ingestion_chunks. The agent then has tools to retrieve relevant chunks on demand: This pattern is essential for any deployment that handles document analysis at scale.

Multimodal output

Today, the Engine streams text and thinking. It does not generate images or audio. If you need image generation, route through a separate tool (an MCP connector to a generation API, or a custom platform tool that calls the provider’s image API) and surface the result as a media block in a follow-up turn.

Pitfalls

Forgetting the media_type. Required for base64 sources. Without it the provider returns a generic error.
Mixing URL and base64 in one block. Pick one source type per block.
Sending the same image twice in a thread. Wastes tokens and money. Reference past turns instead — “in the chart you saw earlier…”
Assuming the model “remembers” images across turns. It does, for as long as they’re in context. Compaction can drop them. If the conversation is long, re-attach the relevant image when needed.
PII in screenshots. Screenshots often contain more than the user intended (chat panes, browser tabs). For sensitive deployments, apply image redaction before send.

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

What you can send

Sending an image

Limits and feature flags

Image size guidance

What works well

What doesn’t

Content ingestion (large media)

Multimodal output

Pitfalls

See also

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Documentation Index

​What you can send

​Sending an image

​Limits and feature flags

​Image size guidance

​What works well

​What doesn’t

​Content ingestion (large media)

​Multimodal output

​Pitfalls

​See also

What you can send

Sending an image

Limits and feature flags

Image size guidance

What works well

What doesn’t

Content ingestion (large media)

Multimodal output

Pitfalls

See also