Using mm with Ollama and Local Models¶
A walkthrough of mm — fast, multimodal context for agents focused on getting text-based context out of images, videos, and audio so you can pipe them into any LLM workflow.
This notebook covers:
- Setting up a local VLM (Gemma 4 via Ollama) and confirming it works end-to-end
- Installing
mmand pointing it at the local VLM - The core commands:
mm find,mm wc,mm cat,mm grep - A few ways to compose Gemma 4 with
mm-style prompting for your own pipelines
Dataset: we've provided a public dataset comprising a mixture of multimodal files - image, video and pdf. You can swap in your own files - any document will work and the commands would all still run.
Make sure the runtime is set to GPU (T4) — Runtime → Change runtime type.
0. GPU check¶
mm cat -m accurate sends images to a local VLM. On CPU that round-trip takes minutes per image instead of seconds, so verify the runtime has a GPU before going further.
If no GPU is detected: Runtime → Change runtime type → T4 GPU, then re-run from the top.
import subprocess
try:
gpu = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
if gpu.returncode == 0:
print("✅ GPU detected — Ollama will use it automatically.\n")
print(gpu.stdout)
else:
print("⚠️ No GPU detected. You're on CPU — accurate-mode VLM calls will be SLOW.")
print(" Fix: Runtime → Change runtime type → T4 GPU, then re-run this notebook.")
except Exception as e:
print(f"Error occured: {e}")
1. Point at your files and pick a model¶
Set the paths to the image and video you just uploaded, and pick a Gemma 4 tag to self-host.
Gemma 4 (released April 2, 2026) is Google DeepMind's latest open multimodal family, built from the same research as Gemini 3, with native text + image input and variable aspect ratio / resolution support.
| Tag | Size on disk | Context | Fits on T4 (15 GB)? |
|---|---|---|---|
gemma4:e2b |
7.2 GB | 128K | ✅ yes (lightest) |
gemma4:e4b (alias gemma4:latest) |
9.6 GB | 128K | ✅ yes (default) |
gemma4:26b (MoE, 4B active) |
18 GB | 256K | ❌ no |
gemma4:31b (dense) |
20 GB | 256K | ❌ no |
On a Colab T4, gemma4:e4b is the sweet spot — best quality that still fits. Drop to gemma4:e2b if you hit OOM.
import os
import tarfile
import urllib.request
from pathlib import Path
# The exact tag string Ollama uses (must match the NAME column from `ollama list`).
# MODEL = "gemma4:e4b"
MODEL = "gemma4:e2b" # lighter fallback
# Preload sample files (image, video, audio, PDF) from the mmbench-tiny bundle.
# `mm` resizes images/videos internally using sensible defaults — no pre-processing
# needed. Override with `--encode.strategy_opts max_width=<val>` when you want more detail.
DATA_URL = "https://storage.googleapis.com/vlm-data-public-prod/mmbench/mmbench-tiny.tar.gz"
DATA_DIR = Path("~/.mm/notebooks/data").expanduser()
MMBENCH_DIR = DATA_DIR / "mmbench-tiny"
if MMBENCH_DIR.is_dir():
print(f"Data already present at {MMBENCH_DIR}")
else:
DATA_DIR.mkdir(parents=True, exist_ok=True)
tar_path = DATA_DIR / "mmbench-tiny.tar.gz"
print(f"Fetching {DATA_URL}")
urllib.request.urlretrieve(DATA_URL, tar_path)
with tarfile.open(tar_path) as tf:
# Skip macOS AppleDouble resource-fork shadows (`._*`) shipped in the tarball —
# they are tiny metadata stubs that break mm grep's indexer.
members = [m for m in tf.getmembers() if not Path(m.name).name.startswith("._")]
tf.extractall(DATA_DIR, members=members)
tar_path.unlink()
print(f"Extracted to {MMBENCH_DIR}")
os.chdir(DATA_DIR)
IMAGE_PATH = str(MMBENCH_DIR / "1-vqa-car.jpg")
VIDEO_PATH = str(MMBENCH_DIR / "bakery.mp4")
AUDIO_PATH = str(MMBENCH_DIR / "how_to_build_an_mvp.mp3")
PDF_PATH = str(MMBENCH_DIR / "BillDownload-8pg.pdf")
2. Preview the image and video¶
from IPython.display import Image, display
display(Image(IMAGE_PATH))
from IPython.display import Video, display
display(Video(VIDEO_PATH, embed=True))
3. Spin up Ollama and pull Gemma 4¶
mm's accurate-mode operations need a VLM on a live server. We'll self-host Gemma 4 with Ollama.
- Ollama v0.20.0+ is required for Gemma 4 (landed April 3, 2026)
zstdis needed to extract Ollama's tarball;pciutilssilences the GPU-detection warning from the installer- Colab has no systemd, so we start
ollama servewithnohupso it keeps running across cells
# ─── Install Ollama + start server (idempotent; safe to re-run) ─────────────
!dpkg -s zstd pciutils >/dev/null 2>&1 || apt-get install -y zstd pciutils
!which ollama >/dev/null 2>&1 || curl -fsSL https://ollama.com/install.sh | sh
!pgrep -x ollama >/dev/null 2>&1 || (nohup ollama serve > /tmp/ollama.log 2>&1 &)
!sleep 3
!ollama --version
# Pull Gemma 4 (~7.2 GB the first time; cached afterwards)
!ollama pull {MODEL}
!ollama list # confirm the NAME column matches MODEL exactly
4. Sanity check: Gemma 4 out of the box¶
Before plugging mm in, let's confirm Gemma 4 actually works on our image — just a plain VLM call against the Ollama server, image in, caption out. This is the simplest possible sanity check: if this works, mm's accurate mode will work too, because mm talks to the same endpoint.
# ─── Gemma 4 VQA → image + caption ──────────────────────────────────────────
# Ask Gemma an open-ended question about the image. Render the image on the
# left and the answer as a caption on the right.
import base64
import io
import requests
from IPython.display import HTML, display as ipy_display
from PIL import Image as _PILImage
OLLAMA_URL = "http://localhost:11434"
QUESTION = (
"Describe this image in 2-4 sentences. What's the setup, what's in it, "
"and what does it look like it was made for?"
)
_img = _PILImage.open(IMAGE_PATH).convert("RGB")
buf = io.BytesIO()
_img.save(buf, format="JPEG", quality=90)
img_b64 = base64.b64encode(buf.getvalue()).decode()
resp = requests.post(
f"{OLLAMA_URL}/api/generate",
json={
"model": MODEL,
"prompt": QUESTION,
"images": [img_b64],
"stream": False,
"options": {"temperature": 0.2, "num_predict": 512},
},
timeout=180,
).json()
answer = resp["response"].strip()
# ─── Render: image left, caption right ──────────────────────────────────────
img_src = f"data:image/jpeg;base64,{img_b64}"
html = f"""
<div style='display:flex; flex-wrap:wrap; gap:16px; align-items:flex-start'>
<div style='flex:2 1 520px; min-width:360px; text-align:center'>
<img src='{img_src}' style='max-width:100%; border-radius:6px'>
<div style='font-size:12px; color:#666; margin-top:6px'>Input image</div>
</div>
<div style='flex:1 1 320px; min-width:280px; max-width:480px;
background:#f7f7f9; border-radius:8px; padding:14px 18px;
font-family:-apple-system,BlinkMacSystemFont,sans-serif; font-size:14px;
line-height:1.5; color:#2e3138'>
<div style='font-size:11px; letter-spacing:0.06em; text-transform:uppercase;
color:#888; margin-bottom:6px'>Question</div>
<div style='margin-bottom:12px; font-style:italic'>{QUESTION}</div>
<div style='font-size:11px; letter-spacing:0.06em; text-transform:uppercase;
color:#888; margin-bottom:6px'>Gemma 4 ({MODEL})</div>
<div>{answer}</div>
</div>
</div>
"""
ipy_display(HTML(html))
5. Install mm¶
The official installer drops the binary in ~/.local/bin.
!pip install mm-ctx
# Verify install and version
!which mm && mm --version
6. Point mm at the local Ollama server¶
mm ships with three reserved profiles: ollama, gemini, and vlmrun. We update the ollama profile to point at our local server and the Gemma 4 model we just pulled, then activate it.
!mm profile update ollama --base-url http://localhost:11434/v1 --model {MODEL}
!mm profile use ollama
!mm profile list # active profile is marked with ●
7. mm find and mm wc — metadata, no VLM¶
These commands work purely on file metadata — no model call, no GPU use.
mm find— tabular listing: kind, size, extension, dimensionsmm wc— quick summary: file count, bytes, estimated lines/tokens
!mm cat {IMAGE_PATH} -m accurate --verbose --no-cache
!mm cat {IMAGE_PATH} -m accurate -p resize --verbose --no-cache
!mm cat --help
!mm cat {IMAGE_PATH} --encode.strategy resize --encode.strategy_opts max_width=800 -m accurate --verbose --no-cache
!mm cat --list-pipelines
# Tabular listing: kind, size, ext, dimensions, etc.
!mm find {IMAGE_PATH}
# Quick summary: file count, bytes, estimated lines/tokens
!mm wc {IMAGE_PATH}
8. mm cat on an image¶
mm cat extracts text context from a file. It has two modes:
-m fast— heuristic-only, no VLM call (quick metadata summary)-m accurate— sends the file to the configured VLM for rich description
Fast mode returns in milliseconds; accurate mode takes a few seconds per image on a T4.
# Fast mode: no VLM call, just metadata
!mm cat {IMAGE_PATH} -m fast --verbose --no-cache
# Accurate mode: sends the image to Gemma 4 via Ollama
!mm cat {IMAGE_PATH} -m accurate --verbose --no-cache
9. mm cat on a video¶
For videos, mm samples frames, builds a mosaic, and feeds it to the VLM. Same two modes as images.
# Fast mode: no VLM call
!mm cat {VIDEO_PATH} -m fast --verbose --no-cache
# Accurate mode: mosaic → Gemma 4
!mm cat {VIDEO_PATH} -m accurate --verbose --no-cache
10. mm grep — semantic search across a folder¶
mm grep runs a natural-language query against every file in a directory, using the active VLM profile. This is the piece that's hardest to replicate with plain grep or find: matching meaning rather than substrings.
!mm grep "invoice" {MMBENCH_DIR} -s --pre-index
!ls {MMBENCH_DIR}
!mm grep --help
# Run 1 — includes cold model load
!time mm cat {IMAGE_PATH} -m accurate --no-cache
# Run 2 — model is warm, pure inference
!time mm cat {IMAGE_PATH} -m accurate --no-cache
# Check if the model is currently loaded and how much VRAM it's using
!curl -s http://localhost:11434/api/ps | python -m json.tool
# ─── Benchmarking helper (with input dimensions) ────────────────────────────
import subprocess
import time
import os
from PIL import Image
def probe_dims(path):
"""Return (width, height, duration_s) for images/videos. Duration is None for images."""
ext = os.path.splitext(path)[1].lower()
if ext in (".jpg", ".jpeg", ".png", ".webp"):
with Image.open(path) as im:
return im.width, im.height, None
else:
probe = subprocess.run(
[
"ffprobe",
"-v",
"error",
"-select_streams",
"v:0",
"-show_entries",
"stream=width,height,duration",
"-of",
"default=noprint_wrappers=1:nokey=1",
path,
],
capture_output=True,
text=True,
)
lines = probe.stdout.strip().splitlines()
w, h = int(lines[0]), int(lines[1])
dur = float(lines[2]) if len(lines) > 2 else None
return w, h, dur
def benchmark_mm(path, label=None, model="gemma4:e4b"):
label = label or os.path.basename(path)
w, h, dur = probe_dims(path)
size_mb = os.path.getsize(path) / 1e6
rows = []
for mode in ("fast", "accurate"):
t0 = time.time()
proc = subprocess.run(
["mm", "cat", path, "-m", mode, "--no-cache"],
capture_output=True,
text=True,
)
wall = time.time() - t0
output = proc.stdout.strip()
n_chars = len(output)
est_tokens = n_chars / 4
tok_per_s = est_tokens / wall if wall > 0 and mode == "accurate" else None
rows.append(
{
"input": label,
"dims": f"{w}x{h}",
"duration_s": round(dur, 1) if dur is not None else None,
"pixels_M": round(w * h / 1e6, 2),
"size_MB": round(size_mb, 2),
"mode": mode,
"wall_s": round(wall, 2),
"est_tokens": round(est_tokens),
"tok_per_s": round(tok_per_s, 1) if tok_per_s else None,
}
)
print(
f" {mode:9s} → {wall:5.1f}s ({n_chars} chars out"
+ (f", ~{tok_per_s:.1f} tok/s)" if tok_per_s else ")")
)
return rows
# ─── Run the benchmark ──────────────────────────────────────────────────────
# IMAGE_PATH and VIDEO_PATH resolve to the mmbench-tiny files downloaded above.
inputs = [
(IMAGE_PATH, "car (image)"),
(VIDEO_PATH, "bakery (video)"),
]
all_rows = []
for path, label in inputs:
print(f"\n📊 {label} ({path})")
all_rows.extend(benchmark_mm(path, label))
# ─── Summary table ──────────────────────────────────────────────────────────
import pandas as pd
df = pd.DataFrame(all_rows)
# Input-level metadata (same across fast/accurate rows, so just take first)
meta = df.groupby("input", sort=False)[["dims", "duration_s", "pixels_M", "size_MB"]].first()
# Per-mode metrics, pivoted so each input is a row
perf = df.pivot(index="input", columns="mode", values=["wall_s", "est_tokens", "tok_per_s"])
perf.columns = [f"{metric}_{mode}" for metric, mode in perf.columns]
summary = meta.join(perf)
summary = summary.reindex([label for _, label in inputs]) # preserve input order
print(summary.to_string())
# ─── Accurate-mode throughput only ──────────────────────────────────────────
acc = df[df["mode"] == "accurate"][
["input", "dims", "duration_s", "pixels_M", "size_MB", "wall_s", "est_tokens", "tok_per_s"]
]
print(acc.to_string(index=False))
print(f"\nMedian accurate-mode throughput: {acc['tok_per_s'].median():.1f} tok/s")
print(f"Mean accurate-mode throughput: {acc['tok_per_s'].mean():.1f} tok/s")