mm.notebook — Visualizing Multimodal VLM Context¶
This notebook demonstrates how mm encodes images, PDFs, videos, and audio into VLM-ready message threads, and how render_messages() lets you inspect exactly what the model sees.
Sections:
- Image + text — default
image-resizeencoder - PDF + text —
rasterizeandpage-textencoders - Video encoders —
video-mosaic,video-keyframes, andvideo-summary - Audio + text — Whisper transcript alongside the native player
- Multi-item context — mixed media in a single prompt
This notebook stays focused on visualization and encoder output. For the role-aware Context API itself — Context.add(), mm.Ref, roles, inspection views, provider payloads, and Context.remove() — see mm-context-api.ipynb.
Dataset: ~/data/mmbench-tiny (image, video, PDF, audio).
from pathlib import Path
import mm
from mm.notebook import render_messages
from IPython.display import HTML
DATA = Path.home() / "data/mmbench-tiny"
1. Image + Text¶
Images use the default image-resize encoder unless you override it. This keeps the example simple: the image is resized to a VLM-friendly size and sent as one image part.
Rich context view¶
Shows the native image with resolution, metadata, role, ref id, and collapsible VLM encoding.
!ls -lha ~/data/mmbench-tiny/
with mm.Context() as ctx:
ctx.add(DATA / "car.jpg", role="user", metadata={"note": "VW beetle"})
msgs = ctx.to_messages()
HTML(render_messages(msgs, title="image (default)"))
Default image-resize encoding¶
Calling to_messages() without an encoder override uses image-resize. This is the default path most prompts should start with.
with mm.Context() as ctx:
ctx.add(DATA / "car.jpg")
msgs = ctx.to_messages()
HTML(render_messages(msgs, title="image-resize (default)"))
2. PDF + Text¶
PDFs can be sent visually or textually depending on the task:
rasterizerenders each page as an image, preserving layout and visual details.page-textextracts text only, which is cheaper and easier to inspect when layout is not important.
Rich context view¶
with mm.Context() as ctx:
ctx.add(DATA / "invoice.pdf", metadata={"type": "invoice", "vendor": "ACME Corp"})
msgs = ctx.to_messages()
HTML(render_messages(msgs, title="document (default)"))
rasterize¶
Use rasterize when the visual layout matters: tables, stamps, signatures, formatting, or anything that plain text extraction might lose. When multiple multi-page documents are added as context, we can configure the number of pages per message.
with mm.Context() as ctx:
ctx.add(DATA / "invoice.pdf")
ctx.add(DATA / "attention-is-all-you-need.pdf")
msgs = ctx.to_messages(
encoders={"document": "rasterize"},
encoder_kwargs={"document": {"pages_per_message": 8}},
)
HTML(render_messages(msgs, title="rasterize (pages_per_message=8)"))
page-text¶
Use page-text when you only need the extracted text. It avoids image payloads and makes the VLM input easier to audit for text-heavy documents.
Note: Below is a classic case of an invoice document that is better parsed as a rasterized image. In this case, the invoice being read via the PDF reader returns an empty string.
with mm.Context() as ctx:
ctx.add(DATA / "invoice.pdf")
ctx.add(DATA / "attention-is-all-you-need.pdf")
msgs = ctx.to_messages(
encoders={"document": "page-text"},
encoder_kwargs={"document": {"pages_per_message": 32}},
)
HTML(render_messages(msgs, title="page-text (pages_per_message=32)"))
3. Video Encoders¶
Video is where encoder choice matters most. This section shows three useful strategies on the same clip:
video-mosaicfor a dense visual overview (default).video-keyframesfor low-cost representative frames from the bitstream.video-summaryfor an adaptive compact summary of longer clips.
Rich context view (native video player + default encoding)¶
with mm.Context() as ctx:
ctx.add(DATA / "bakery.mp4")
msgs = ctx.to_messages()
HTML(render_messages(msgs, title="video (default)"))
video-mosaic¶
video-mosaic samples frames and packs them into tiled grids. It is a compact way to see scene coverage, visual progression, and repeated motifs without sending many separate image parts.
with mm.Context() as ctx:
ctx.add(DATA / "bakery.mp4")
msgs = ctx.to_messages(
encoders={"video": "video-mosaic"},
encoder_kwargs={
"video": {"tile_cols": 3, "tile_rows": 3, "thumb_width": 320, "num_mosaics": 4}
},
)
HTML(render_messages(msgs, title="video-mosaic (3x3 grid, 4 mosaics)"))
video-keyframes¶
video-keyframes pulls I-frames directly from the video stream. It is cheap and fast because it uses frames the codec already marks as important, making it a good first-pass overview.
with mm.Context() as ctx:
ctx.add(DATA / "bakery.mp4")
msgs = ctx.to_messages(
encoders={"video": "video-keyframes"},
encoder_kwargs={
"video": {"max_keyframes": 32, "max_width": 512, "max_keyframes_per_message": 4}
},
)
HTML(render_messages(msgs, title="video-keyframes (max_keyframes=32, 512px)"))
video-summary¶
video-summary chooses a compact set of representative frames for a higher-level skim. Use it when you want broad coverage of a clip without exhaustively sending every sampled segment.
with mm.Context() as ctx:
ctx.add(DATA / "bakery.mp4")
msgs = ctx.to_messages(
encoders={"video": "video-summary"},
encoder_kwargs={"video": {"num_frames": 16, "max_width": 512}},
)
HTML(render_messages(msgs, title="video-summary (num_frames=16, 512px)"))
4. Audio¶
Two audio encoders:
audio-base64(default) — sends the raw audio waveform as a base64-encodedinput_audiopart. The model hears the actual audio.audio-transcribe— runs transcription and sends the text transcript. Useful when the model doesn't support native audio input.
The audio-transcribe encoder supports pluggable backends via encoder_kwargs:
| Backend | When to use | encoder_kwargs |
|---|---|---|
mlx |
Apple Silicon (fastest, auto-detected) | {"audio": {"backend": "mlx"}} |
ctranslate2 |
CPU/CUDA (auto-detected) | {"audio": {"backend": "ctranslate2"}} |
openai |
Remote API (Ollama, OpenAI, vLLM) | {"audio": {"backend": "openai", "base_url": "http://localhost:11434/v1"}} |
Auto-detection picks the best local backend by default. Use mm.common.audio.list_backends() to see what's available.
with mm.Context() as ctx:
ctx.add(DATA / "how_to_build_an_mvp.mp3")
msgs = ctx.to_messages()
HTML(render_messages(msgs, title="audio-base64 (default)"))
with mm.Context() as ctx:
ctx.add(DATA / "how_to_build_an_mvp.mp3")
msgs = ctx.to_messages(
encoders={"audio": "audio-transcribe"},
encoder_kwargs={"audio": {"whisper_model": "tiny"}},
)
HTML(render_messages(msgs, title="audio-transcribe (whisper tiny, mlx backend)"))
5. Multi-item Context¶
A single Context with mixed media — this is what a real agent prompt looks like.
Use to_messages() with custom encoders and render_messages() for the rich view.
with mm.Context() as ctx:
ctx.add(DATA / "car.jpg")
ctx.add(DATA / "invoice.pdf")
ctx.add(DATA / "bakery.mp4")
ctx.add(DATA / "how_to_build_an_mvp.mp3")
msgs = ctx.to_messages(
encoder_kwargs={
"video": {"tile_cols": 4, "tile_rows": 4, "num_mosaics": 4},
"document": {"pages_per_message": 32},
},
)
HTML(render_messages(msgs, title="mixed context (image + pdf + video + audio)"))