mm — User Guide¶

Fast, multimodal context for agents. Rust core for speed. Python for developer experience. Unix philosophy for composability.

Prerequisites¶

Installation¶

pip install mm-ctx

# with uv
uv pip install mm-ctx

# or run directly without installing
uvx --from mm-ctx mm --help

Alternative methods:

# macOS / Linux (shell installer)
curl -LsSf https://vlm-run.github.io/mm/install/install.sh | sh

# Windows (PowerShell)
irm https://vlm-run.github.io/mm/install/install.ps1 | iex

VLM access¶

mm requires access to a VLM on a live server for accurate-mode (LLM-powered) operations. Recommended models:

Provider	Models
Qwen	`qwen3vl-2b\\|4b\\|8b\\|32b`, `qwen3.5:2b\\|9b\\|27b`
Gemini	`gemini-2.5-flash-lite`, `gemini-3.1-flash-lite-preview`

Profile setup¶

mm uses profiles to store provider credentials. There are 3 reserved profiles: ollama, gemini, and vlmrun.

You can populate a reserved profile or create a new one:

# Use an existing reserved profile
mm profile update ollama --base-url http://localhost:11434/v1 --model qwen3vl-8b

# Or create a custom profile
mm profile add fermi \
  --base-url https://openrouter.ai/api/v1 \
  --api-key "your-openrouter-api-key" \
  --model google/gemini-2.5-flash-lite

# Set the active profile
mm profile use fermi

Integrations¶

Claude Code¶

Install the mm-cli-skill via the skill marketplace:

claude
> /plugin marketplace add vlm-run/skills
> /plugin install mm-cli-skill@vlm-run/skills
> Organize my ~/Downloads folder using mm

npx skills¶

Install mm-cli-skill globally so any CLI assistant or agentic tool can discover it:

npx skills add vlm-run/skills@mm-cli-skill

Universal assistants (OpenClaw, NemoClaw, OpenCode, Codex, Gemini CLI)¶

Install the mm-cli-skill globally first, then start your preferred tool:

# One-time setup
npx skills add vlm-run/skills@mm-cli-skill

# Then use any CLI assistant — it will discover mm automatically
openclaw "Organize my ~/Downloads folder using mm"
codex "Find all PDFs in ~/docs and summarize them with mm"

The skill exposes mm's capabilities to any tool that supports the skills protocol.

Use cases¶

Use mm directly or through a CLI assistant (e.g. claude "Organize ~/Downloads using mm").

Directory actions¶

Organize files by type, date, or inferred content
Flatten or restructure nested folders based on content signals
Cluster photos by person, location, or event

Semantic search¶

mm grep "photo of me and my dog in a park" ~/photos -s
mm grep "revenue forecast" ~/reports -s --kind document
mm grep "architecture overview" ~/docs -s --pre-index   # auto-index unindexed files

Returns matching files via vector similarity (embeddings). Use --semantic/-s for semantic search, --pre-index to auto-index unindexed files before searching.

Raw file metadata¶

# mm peek: dimensions / EXIF / codec / duration / mime / hash.
mm peek report.pdf             # mime, content hash
mm peek image.jpg              # image dimensions, MIME, hash, EXIF
mm peek video.mp4              # video resolution, duration, codecs (<100ms)
mm peek image.jpg video.mp4 --format json   # multi-file JSON
mm peek report.pdf --full      # opt in to document author / title / subject / keywords / page count

File inspection and extraction¶

# mm cat: content extraction. Default --mode fast.
mm cat report.pdf              # PDF page-text via pypdfium2 (fast pipeline)
mm cat src/main.py             # passthrough text
mm cat notes.docx              # libreoffice-rs text
mm cat image.jpg               # short VLM caption (fast pipeline)
mm cat video.mp4               # mosaic → short VLM description (fast pipeline)

# --mode accurate: LLM-heavy pipeline for image/video/audio/PDF (requires a configured profile).
mm cat image.jpg -m accurate   # LLM-powered caption + tags + objects
mm cat video.mp4 -m accurate   # keyframe mosaic → LLM description
mm cat audio.mp3 -m accurate   # transcript → LLM summary
mm cat report.pdf -m accurate  # text → LLM markdown structuring

kind=text and non-PDF documents (.docx / .pptx) ignore --mode entirely: they always return passthrough text.

Pipeline customization¶

-p, --encode.*, --generate.*, --prompt, --model apply to image / video / audio / PDF (the kinds that run a pipeline); they are no-ops for kind=text and non-PDF documents.

mm cat photo.png -m fast -p image-tile                       # named encoder
mm cat photo.png -m accurate -p my-pipeline.yaml             # custom pipeline YAML
mm cat photo.png -m accurate --encode.strategy image-tile    # override encoder
mm cat photo.png -m accurate --encode.strategy_opts max_width=768
mm cat photo.png -m accurate --prompt "List 3 objects."      # = --generate.prompt
mm cat photo.png -m accurate --model qwen3.5-0.8b            # = --generate.model
mm cat photo.png -m accurate --generate.max-tokens 1024
mm cat photo.png -m accurate --no-cache --no-generate        # encoder-only snapshot
mm cat --list-encoders                                       # registered encoders
mm cat --list-pipelines                                      # built-in pipelines
mm cat --print-pipeline image/accurate                       # built-in pipeline source

Override surfaces¶

Every mm cat invocation resolves its LLM call from three layers, with CLI > pipeline YAML > profile precedence on conflict:

Profile (mm.toml) — owns base_url, api_key, default model. Switch profiles globally per-call with mm --profile <name> <subcommand>.
Pipeline YAML (generate: block) — model, prompt, max_tokens, temperature, json_mode, extra_body. Each pipeline can pin provider-specific defaults so a single command stays terse.
CLI flags on cat — per-field, per-invocation overrides:

Flag	Alias	Pipeline field
`--model NAME`	`--generate.model`	`generate.model`
`--prompt TEXT`	`--generate.prompt`	`generate.prompt`
`--generate.max-tokens N`	—	`generate.max_tokens`
`--generate.temperature F`	—	`generate.temperature`
`--generate.json-mode BOOL`	—	`generate.json_mode`
`--generate.extra-body '<json>'`	—	`generate.extra_body` (deep-merged)

base_url and api_key are profile-only — there is no CLI override for them. The merged model + extra_body participate in the L2 cache key, so changing a knob correctly invalidates cached results.

Using these flags to drive an arbitrary OpenAI-compatible deployment (e.g. vlmrt, where each model dispatches by extra_body.method):

# Florence-2 — document OCR
mm --profile vlmrt cat page.png -m accurate \
  --model florence-2-base-ft \
  --generate.extra-body '{"method":"ocr"}'

# Qwen3.5-0.8B — free-form video summarisation, custom frame sampling
mm --profile vlmrt cat clip.mp4 -m accurate \
  --model qwen3.5-0.8b \
  --generate.extra-body '{"video_fps":1.0,"video_max_frames":8}'

# PaddleOCR-v5 — Chinese scene-text OCR with tighter score threshold
mm --profile vlmrt cat storefront.jpg -m accurate \
  --model paddleocr-v5 \
  --generate.extra-body '{"method":"ocr","method_params":{"lang":"ch","score_threshold":0.6}}'

# Moondream2 — multi-object detection with a custom prompt
mm --profile vlmrt cat photo.jpg -m accurate \
  --model moondream2 \
  --prompt "List every visible animal." \
  --generate.extra-body '{"method":"detect","method_params":{"object":"fish"}}'

Batch operations¶

mm wc ~/docs                            # file count, bytes, lines, token estimate
mm find ~/videos                        # list with tags, duration, resolution
mm cat -m accurate video.mp4            # full context: transcript + scenes
mm find ~/images --kind image | mm cat -m accurate --format json    # batch captioning
mm find ~/images --kind image | mm peek --format json               # batch metadata (no LLM)
mm cat -y ~/images/*.jpg                # skip ≥9-path confirmation prompt

Agentic integration¶

Use mm directly as a tool or as a skill for any coding assistants:

"Find all invoices in ~/Downloads and create a markdown table with totals"
"Clip the first scene from video.mp4"
"Extract all faces from ~/events/wedding"

Auto-labeling¶

Use mm as a labeling CLI for VLMs:

Select provider with --profile or MM_PROFILE env for any OpenAI-compatible endpoint
--format dataset-jsonl — outputs image (base64) + completion pairs for fine-tuning (OpenAI/Fireworks format)
--format dataset-hf — builds HuggingFace datasets from input directories (requires --output-dir)

Pipeline: unlabeled media → mm cat -m accurate --format dataset-jsonl → fine-tuning

Other examples¶

Photo organization by topic or year, with highlighted selections from events
"Create a markdown file 260410.md with a table of all invoices in ~/Downloads/invoices, including totals in USD"

Benchmark¶

Run the built-in benchmark suite:

mm bench ~/data/mmbench-mini --format rich

A standalone benchmark script is also available at ./benchmarks/bench_cli.sh. It downloads public multimodal test data before running.

Data sources: - https://storage.googleapis.com/vlm-data-public-prod/mmbench/mmbench-mini.tar.gz - https://storage.googleapis.com/vlm-data-public-prod/mmbench/mmbench-tiny.tar.gz

Custom benchmark suites: `--bench-file`¶

For internal matrices, point mm bench at a Python file that exposes BenchCommand entries:

mm bench ~/data/mmbench-tiny --bench-file benchmarks/vlmgw_bench_commands.py
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py -r 1 -w 0

A benchfile must define one of:

from mm.commands.bench_commands import BenchCommand

# (a) static list
COMMANDS: list[BenchCommand] = [...]

# (b) file-aware factory (preferred when commands depend on what's on disk)
def commands(files) -> list[BenchCommand]: ...

The factory takes precedence when both are present. The loaded set fully replaces the built-in overhead + metadata + <mode> matrix — --mode is ignored when --bench-file is set, and the benchfile's own BenchCommand.group drives display grouping. --command substring filtering and --format rendering still apply on top, so you can scope a slow benchfile to a single row with --command image-resolution or pipe its JSON to a custom renderer.

A worked example covering all mm cat override surfaces (model alias, prompt overrides, --generate.extra-body deep-merge, video frame sampling, cache cold/warm) lives in benchmarks/vlmgw_bench_commands.py.

Table layout: `Group | Model | Base Command | Extra Args | <metrics>`¶

The bench renderer surfaces up to four structured columns before the metrics — the eye lands on the variation between rows rather than on the boilerplate that's stable across them:

Group — the BenchCommand.group bucket (e.g. model, image-res, cache, 404).
Model — pulled from BenchCommand.tags["model"]. Conventionally follows the <org>/<model-name> format (e.g. microsoft/florence-2-base-ft, vikhyatk/moondream2, qwen/qwen3.5-0.8b) so the column is unambiguous across providers. The column is only shown when at least one row declares a model tag, so the default suite stays compact.
Base Command — the stable part of the resolved invocation: mm cat <img> --mode fast --no-cache --format json (or mm find <dir> --format json for the find rows). --profile is stripped (it's constant across every row of a given benchfile run) and --model is stripped (already in its own column); file paths are substituted with kind-based placeholders (<img> / <vid> / <doc> / <aud> / <code> / <dir>) so paths don't drown out the actual command shape.
Extra Args — the variant-specific knobs: --prompt …, --generate.…, --encode.…. This column is only rendered when at least one row has extras, so the default suite collapses to Group | Base Command | <metrics>.

BenchCommand.tags can declare additional keys for downstream JSON consumers; only the model tag drives a dedicated rich-table column. High-cardinality / long-string metadata (full prompts, JSON blobs) lives in Extra Args rather than in tag columns — the column-per-tag mechanism is reserved for short, comparable identifiers like model and provider.

Bench recording: `benchmarks/results/<YYMMDD>-mm-bench-<profile>-<HHMM>.md`¶

Every non-dry-run mm bench invocation also writes a per-row markdown snapshot under benchmarks/results/. The plural benchmarks/ directory continues to host author-curated benchfiles (vlmgw_bench_commands.py, ad-hoc shell helpers, …); the nested results/ subfolder is reserved for generated recordings so a rm -rf benchmarks/results/ only nukes auto-produced artefacts and never the curated inputs. The file contains, for each benchmarked command, the exact same single-row Rich table the live bench rendered — followed by a fenced block holding that row's captured stdout. This keeps a portable, diffable record of what was measured and what each command actually returned, which is particularly useful when chasing regressions across gateway versions or comparing two benchfiles side-by-side.

Path derivation:

benchmarks/results/<YYMMDD>-mm-bench-<profile>-<HHMM>.md (relative to the current working directory; benchmarks/results/ is created if missing).
<profile> is the active mm profile name (the same one that drives mm cat's default base URL / model — see the --profile flag and ~/.config/mm/mm.toml). When no profile is configured the stem falls back to default.
<HHMM> is the 24-hour wall-clock time the run started. Two runs in the same minute will collide, but iterating on a benchfile every few minutes preserves history without manual git mvs.
The recording is keyed on the deployment being measured, not on the benchfile that wrote the matrix: re-running the same benchfile against a different profile writes to a different file (different <profile> slot), and two benchfiles aimed at the same profile produce sibling files distinguished only by <HHMM>.

Per-row layout:

╭──────┬──────────────────┬──────────────────────────────┬───────┬───────╮
│ ...  │ Model            │ Base Command                 │  Mean │  ...  │
├──────┼──────────────────┼──────────────────────────────┼───────┼───────┤
│ ...  │ qwen/qwen3.5-…   │ mm cat <img> --mode fast …   │ 2.91s │  ...  │
╰──────┴──────────────────┴──────────────────────────────┴───────┴───────╯
args: {"img": "1-vqa-car.jpg", "mode": "fast"}
```json
{...captured stdout...}

2.91s • 38.2 KB • 13.1 KB/s

- The Rich table is emitted as raw markdown content (no ` ```text `
  wrapping fence) so renderers display its box-drawing characters
  directly — matching the live `mm bench` view.
- `args:` carries a JSON one-liner with the resolved data inputs
  (under the kind alias `img` / `vid` / `aud` / `doc` / `code`) and
  `mode` when the row uses `--mode <X>`. This restores the actual
  basenames the `<img>` / `<vid>` placeholders in the Base Command
  column collapse away. Omitted entirely when there's nothing to
  surface (e.g. directory-level `mm find` rows).
- The captured stdout is wrapped in a fenced block — `json` when it
  starts with `{` / `[`, else `text`. ANSI-stripped; absolute paths
  from the resolved argv are rewritten to basenames so the markdown
  stays portable across machines.
- The footer line `<elapsed> • <bytes> • <bytes/s>` reports the
  *last* timed round (the one that produced the captured stdout),
  complementing the aggregate stats already shown in the row table.
- Skipped rows render their snapshot table with `skipped: <reason>`
  and a `text` block carrying the same reason; no footer.
- Non-zero exits produce a `text` block prefixed with `[exit N]`
  followed by the last 5 stderr lines, plus the standard footer.

The header carries the resolved invocation (rounds, warmup, file
count + size, total wall), the host one-liner (hostname / CPU / OS
/ Python / mm version), and the active profile.

The path is logged to stderr (`Wrote recording to …`) regardless of
`--format`, so you always see where the snapshot landed.

Skipped when:

- `--dry-run` (nothing measured to record).
- `--host-info` (it's not a measurement run).
- `--format stdout` (snapshot mode has its own destination via stdout
  redirection, e.g. `tests/stdout/cat.md`).

### Filtering: `--group`, `--model`, `--command`

Four independent filters compose via AND, so you can scope a run to
exactly the rows you want:

- `--group/-g GROUP` — exact match (case-insensitive) against
  `BenchCommand.group`. E.g. `--group cache` keeps only the cache rows.
- `--model MODEL` — exact match (case-insensitive) against
  `BenchCommand.tags["model"]`. Cuts across groups, e.g.
  `--model qwen/qwen3.5-0.8b` keeps every row pinned to qwen regardless
  of which bucket it lives in (model / image-res / video-frames /
  cache, etc.).
- `--task TASK` — exact match (case-insensitive) against
  `BenchCommand.tags["task"]`. Conventional taxonomy: `cap` (caption),
  `ocr`, `det` (detection), `seg` (segmentation), `llm` (text-only
  generation), `pose`, `track`, `noop` (gateway round-trip cost).
  Cuts across groups *and* models — `--task ocr` returns every OCR
  variant in the matrix regardless of which provider serves it,
  which is the natural slice when comparing capabilities across
  deployments. Rows without a `task` tag (e.g. `404/*` and
  `validation/*` infrastructure tests) are intentionally invisible
  to this filter.
- `--command/-c TERM` — substring filter on `BenchCommand.name`. The
  name is a stable variant identifier (e.g. `florence2/caption`,
  `qwen/image-512`) defined by the benchfile; it's used purely for
  filtering and JSON output and isn't shown as a column in the rich
  table (the resolved `Command` cell is more informative).

```bash
# Just the model variants
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py --group model

# Every row using qwen/qwen3.5-0.8b across all groups
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py --model qwen/qwen3.5-0.8b

# Every OCR row across all providers (florence2, dots-ocr, paddleocr)
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py --task ocr

# Just the qwen captioning rows (model AND task compose via AND)
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py --task cap --model qwen/qwen3.5-0.8b

# Just the SAM3 rows in the model group
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py -g model --model facebook/sam3

Inspecting a plan: `--dry-run`¶

--dry-run resolves the benchmark plan — directory pre-scan, file selection, placeholder substitution — without invoking any subprocess. Every row renders with - placeholders in the rich/tsv table and "dry_run": true in JSON, with the resolved shell command in argv for inspection:

mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py --dry-run
mm bench ~/data/mmbench-tiny -b benchmarks/vlmgw_bench_commands.py --dry-run --format json

Useful for verifying a new benchfile before committing to a long run, or for snapshotting the plan in CI without paying timing cost.