`mm.notebook` — Visualizing Multimodal VLM Context¶

This notebook demonstrates how mm encodes images, PDFs, videos, and audio into VLM-ready message threads, and how render_messages() lets you inspect exactly what the model sees.

Sections:

Image + text — default image-resize encoder
PDF + text — rasterize and page-text encoders
Video encoders — video-mosaic, video-keyframes, and video-summary
Audio + text — Whisper transcript alongside the native player
Multi-item context — mixed media in a single prompt

This notebook stays focused on visualization and encoder output. For the role-aware Context API itself — Context.add(), mm.Ref, roles, inspection views, provider payloads, and Context.remove() — see mm-context-api.ipynb.

Dataset: ~/data/mmbench-tiny (image, video, PDF, audio).

In [ ]:

Copied!

from pathlib import Path

import mm
from mm.notebook import render_messages
from IPython.display import HTML

DATA = Path.home() / "data/mmbench-tiny"
from pathlib import Path

import mm
from mm.notebook import render_messages
from IPython.display import HTML

DATA = Path.home() / "data/mmbench-tiny"

1. Image + Text¶

Images use the default image-resize encoder unless you override it. This keeps the example simple: the image is resized to a VLM-friendly size and sent as one image part.

Rich context view¶

Shows the native image with resolution, metadata, role, ref id, and collapsible VLM encoding.

In [ ]:

Copied!

!ls -lha ~/data/mmbench-tiny/
!ls -lha ~/data/mmbench-tiny/

In [ ]:

Copied!

with mm.Context() as ctx:
    ctx.add(DATA / "car.jpg", role="user", metadata={"note": "VW beetle"})
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="image (default)"))
with mm.Context() as ctx:
    ctx.add(DATA / "car.jpg", role="user", metadata={"note": "VW beetle"})
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="image (default)"))

Default `image-resize` encoding¶

Calling to_messages() without an encoder override uses image-resize. This is the default path most prompts should start with.

In [ ]:

Copied!

with mm.Context() as ctx:
    ctx.add(DATA / "car.jpg")
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="image-resize (default)"))
with mm.Context() as ctx:
    ctx.add(DATA / "car.jpg")
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="image-resize (default)"))

2. PDF + Text¶

PDFs can be sent visually or textually depending on the task:

rasterize renders each page as an image, preserving layout and visual details.
page-text extracts text only, which is cheaper and easier to inspect when layout is not important.

Rich context view¶

In [ ]:

Copied!

with mm.Context() as ctx:
    ctx.add(DATA / "invoice.pdf", metadata={"type": "invoice", "vendor": "ACME Corp"})
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="document (default)"))
with mm.Context() as ctx:
    ctx.add(DATA / "invoice.pdf", metadata={"type": "invoice", "vendor": "ACME Corp"})
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="document (default)"))

`rasterize`¶

Use rasterize when the visual layout matters: tables, stamps, signatures, formatting, or anything that plain text extraction might lose. When multiple multi-page documents are added as context, we can configure the number of pages per message.

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "invoice.pdf")
    ctx.add(DATA / "attention-is-all-you-need.pdf")
    msgs = ctx.to_messages(
        encoders={"document": "rasterize"},
        encoder_kwargs={"document": {"pages_per_message": 8}},
    )

HTML(render_messages(msgs, title="rasterize (pages_per_message=8)"))
with mm.Context() as ctx:
    ctx.add(DATA / "invoice.pdf")
    ctx.add(DATA / "attention-is-all-you-need.pdf")
    msgs = ctx.to_messages(
        encoders={"document": "rasterize"},
        encoder_kwargs={"document": {"pages_per_message": 8}},
    )

HTML(render_messages(msgs, title="rasterize (pages_per_message=8)"))

`page-text`¶

Use page-text when you only need the extracted text. It avoids image payloads and makes the VLM input easier to audit for text-heavy documents.

Note: Below is a classic case of an invoice document that is better parsed as a rasterized image. In this case, the invoice being read via the PDF reader returns an empty string.

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "invoice.pdf")
    ctx.add(DATA / "attention-is-all-you-need.pdf")
    msgs = ctx.to_messages(
        encoders={"document": "page-text"},
        encoder_kwargs={"document": {"pages_per_message": 32}},
    )

HTML(render_messages(msgs, title="page-text (pages_per_message=32)"))
with mm.Context() as ctx:
    ctx.add(DATA / "invoice.pdf")
    ctx.add(DATA / "attention-is-all-you-need.pdf")
    msgs = ctx.to_messages(
        encoders={"document": "page-text"},
        encoder_kwargs={"document": {"pages_per_message": 32}},
    )

HTML(render_messages(msgs, title="page-text (pages_per_message=32)"))

3. Video Encoders¶

Video is where encoder choice matters most. This section shows three useful strategies on the same clip:

video-mosaic for a dense visual overview (default).
video-keyframes for low-cost representative frames from the bitstream.
video-summary for an adaptive compact summary of longer clips.

Rich context view (native video player + default encoding)¶

In [ ]:

Copied!

with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="video (default)"))
with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="video (default)"))

`video-mosaic`¶

video-mosaic samples frames and packs them into tiled grids. It is a compact way to see scene coverage, visual progression, and repeated motifs without sending many separate image parts.

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages(
        encoders={"video": "video-mosaic"},
        encoder_kwargs={
            "video": {"tile_cols": 3, "tile_rows": 3, "thumb_width": 320, "num_mosaics": 4}
        },
    )

HTML(render_messages(msgs, title="video-mosaic (3x3 grid, 4 mosaics)"))
with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages(
        encoders={"video": "video-mosaic"},
        encoder_kwargs={
            "video": {"tile_cols": 3, "tile_rows": 3, "thumb_width": 320, "num_mosaics": 4}
        },
    )

HTML(render_messages(msgs, title="video-mosaic (3x3 grid, 4 mosaics)"))

`video-keyframes`¶

video-keyframes pulls I-frames directly from the video stream. It is cheap and fast because it uses frames the codec already marks as important, making it a good first-pass overview.

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages(
        encoders={"video": "video-keyframes"},
        encoder_kwargs={
            "video": {"max_keyframes": 32, "max_width": 512, "max_keyframes_per_message": 4}
        },
    )

HTML(render_messages(msgs, title="video-keyframes (max_keyframes=32, 512px)"))
with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages(
        encoders={"video": "video-keyframes"},
        encoder_kwargs={
            "video": {"max_keyframes": 32, "max_width": 512, "max_keyframes_per_message": 4}
        },
    )

HTML(render_messages(msgs, title="video-keyframes (max_keyframes=32, 512px)"))

`video-summary`¶

video-summary chooses a compact set of representative frames for a higher-level skim. Use it when you want broad coverage of a clip without exhaustively sending every sampled segment.

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages(
        encoders={"video": "video-summary"},
        encoder_kwargs={"video": {"num_frames": 16, "max_width": 512}},
    )

HTML(render_messages(msgs, title="video-summary (num_frames=16, 512px)"))
with mm.Context() as ctx:
    ctx.add(DATA / "bakery.mp4")
    msgs = ctx.to_messages(
        encoders={"video": "video-summary"},
        encoder_kwargs={"video": {"num_frames": 16, "max_width": 512}},
    )

HTML(render_messages(msgs, title="video-summary (num_frames=16, 512px)"))

4. Audio¶

Two audio encoders:

audio-base64 (default) — sends the raw audio waveform as a base64-encoded input_audio part. The model hears the actual audio.
audio-transcribe — runs transcription and sends the text transcript. Useful when the model doesn't support native audio input.

The audio-transcribe encoder supports pluggable backends via encoder_kwargs:

Backend	When to use	`encoder_kwargs`
`mlx`	Apple Silicon (fastest, auto-detected)	`{"audio": {"backend": "mlx"}}`
`ctranslate2`	CPU/CUDA (auto-detected)	`{"audio": {"backend": "ctranslate2"}}`
`openai`	Remote API (Ollama, OpenAI, vLLM)	`{"audio": {"backend": "openai", "base_url": "http://localhost:11434/v1"}}`

Auto-detection picks the best local backend by default. Use mm.common.audio.list_backends() to see what's available.

In [ ]:

Copied!

with mm.Context() as ctx:
    ctx.add(DATA / "how_to_build_an_mvp.mp3")
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="audio-base64 (default)"))
with mm.Context() as ctx:
    ctx.add(DATA / "how_to_build_an_mvp.mp3")
    msgs = ctx.to_messages()

HTML(render_messages(msgs, title="audio-base64 (default)"))

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "how_to_build_an_mvp.mp3")
    msgs = ctx.to_messages(
        encoders={"audio": "audio-transcribe"},
        encoder_kwargs={"audio": {"whisper_model": "tiny"}},
    )

HTML(render_messages(msgs, title="audio-transcribe (whisper tiny, mlx backend)"))
with mm.Context() as ctx:
    ctx.add(DATA / "how_to_build_an_mvp.mp3")
    msgs = ctx.to_messages(
        encoders={"audio": "audio-transcribe"},
        encoder_kwargs={"audio": {"whisper_model": "tiny"}},
    )

HTML(render_messages(msgs, title="audio-transcribe (whisper tiny, mlx backend)"))

5. Multi-item Context¶

A single Context with mixed media — this is what a real agent prompt looks like. Use to_messages() with custom encoders and render_messages() for the rich view.

In [ ]:

Copied!





with mm.Context() as ctx:
    ctx.add(DATA / "car.jpg")
    ctx.add(DATA / "invoice.pdf")
    ctx.add(DATA / "bakery.mp4")
    ctx.add(DATA / "how_to_build_an_mvp.mp3")
    msgs = ctx.to_messages(
        encoder_kwargs={
            "video": {"tile_cols": 4, "tile_rows": 4, "num_mosaics": 4},
            "document": {"pages_per_message": 32},
        },
    )

HTML(render_messages(msgs, title="mixed context (image + pdf + video + audio)"))
with mm.Context() as ctx:
    ctx.add(DATA / "car.jpg")
    ctx.add(DATA / "invoice.pdf")
    ctx.add(DATA / "bakery.mp4")
    ctx.add(DATA / "how_to_build_an_mvp.mp3")
    msgs = ctx.to_messages(
        encoder_kwargs={
            "video": {"tile_cols": 4, "tile_rows": 4, "num_mosaics": 4},
            "document": {"pages_per_message": 32},
        },
    )

HTML(render_messages(msgs, title="mixed context (image + pdf + video + audio)"))

mm.notebook — Visualizing Multimodal VLM Context¶

1. Image + Text¶

Rich context view¶

Default image-resize encoding¶

2. PDF + Text¶

Rich context view¶

rasterize¶

page-text¶

3. Video Encoders¶

Rich context view (native video player + default encoding)¶

video-mosaic¶

video-keyframes¶

video-summary¶

4. Audio¶

5. Multi-item Context¶

`mm.notebook` — Visualizing Multimodal VLM Context¶

Default `image-resize` encoding¶

`rasterize`¶

`page-text`¶

`video-mosaic`¶

`video-keyframes`¶

`video-summary`¶