mm cat¶
Unified content extraction. Behaviour driven by file type × mode × pipeline.
Input¶
- Multimodal: auto-detects kind from extension → image, video, audio, document, text
- Multi-file:
mm cat a.jpg b.pdf c.mp4— processes each sequentially - Large batches: if the path count is ≥ 9 (i.e. more than 8 files; override with
MM_CAT_BATCH_CONFIRM_THRESHOLD),catasks for confirmation in a TTY; in non-interactive use it exits with an error unless you pass--yes/-y - Stdin:
find . -name '*.pdf' | mm cat— reads newline-delimited paths from stdin - Head/tail:
-n 20(first 20 lines),-n -20(last 20 lines)
Modes¶
--mode fast (default) — runs the kind's fast pipeline. Whether an LLM is involved depends on the pipeline's generate step (image/fast.yaml has one, document/fast.yaml does not).
--mode accurate — runs the kind's accurate pipeline; always LLM-heavy.
Both fast and accurate read from the metadata tier as their input —
the locally-extracted content cached in files.text_preview. That tier
never invokes an LLM and is reusable across both fast and accurate
extractions of the same file.
kind=text and non-PDF documents (.docx / .pptx) ignore --mode entirely: they always return passthrough
text and write FK-orphan chunks + concurrent embeddings on first sight, no extractions row.
Image¶
| fast (default) | accurate | |
|---|---|---|
| Encoder | resize (max 512px) |
resize (max 1024px) |
| Output | 10-word description + 5 tags | ~200-word description + 10 tags + 10 objects |
| Tokens | 256 max | 2048 max |
Video¶
Multi-file: mm cat a.mp4 b.mp4 -y runs each video sequentially; the same ≥ 9 paths batch rule applies as for images.
| fast (default) | accurate | |
|---|---|---|
| Encoder | mosaic (4×4 grid, 128 frames, up to 8 mosaics) |
frames-transcript (1fps, whisper medium, no speedup) |
| Output | 50-word description + tags | ~200-word summary + tags + scene breakdown |
| Tokens | 512 max | 1536 max |
| Audio | none | whisper medium, 1.0× speed |
Audio¶
| fast (default) | accurate | |
|---|---|---|
| Encoder | audio-transcribe (whisper medium, 1.0×) |
audio-transcribe (whisper medium, 1.0×) |
| Output | raw timestamped transcript | ~80-word summary from transcript |
| Tokens | — | 512 max |
Transcription backends (auto-detected by priority, override via --encode.backend or mm config set transcription.backend):
| Backend | Priority | Device | Notes |
|---|---|---|---|
mlx |
10 | Apple Metal GPU | Requires mm[mlx] extra |
ctranslate2 |
20 | CPU (int8) / CUDA (float16) | Included in core install |
openai |
30 | Remote API | Any OpenAI-compatible /v1/audio/transcriptions endpoint |
Selecting a backend — precedence from most to least specific:
- CLI flag (one-off):
mm cat audio.mp3 --encode.backend openai - Pipeline YAML (
encode.backend:top-level) - Global default (
mm config set transcription.backend openai), persisted in[transcription]of~/.config/mm/mm.toml - Environment (
MM_TRANSCRIPTION_BASE_URL, openai backend only) - Auto-detect (mlx → ctranslate2 → openai)
Python API encoders for audio:
| Name | Description |
|---|---|
audio-base64 |
(default in to_messages) Raw base64-encoded audio for native VLM input |
audio-transcribe |
Whisper transcript as text, supports backend/base_url/api_key kwargs |
audio-gemini |
Pass audio file as a Gemini Part |
Document (PDF only)¶
| fast (default) | accurate | |
|---|---|---|
| Encoder | page-text (pypdfium2, 1 page/message) |
page-text (pypdfium2, 1 page/message) |
| Output | concatenated page text | lossless markdown restructuring |
| Tokens | — | 16384 max |
Document (DOCX / PPTX) and Text / Code / Config¶
Passthrough in all modes — raw file content extracted via libreoffice-rs(DOCX/PPTX/XLSX/ODS/ODT/ODP) or read_text (text/code/config). No pipeline, no LLM; mode is a no-op.
Caching¶
The metadata tier is cached in files.text_preview keyed by content_hash
(populated by extract_meta; reused on every subsequent cat of the same
file, regardless of mode).
The unified extractions table (SQLite at ~/.local/share/mm/mm.db) caches
both fast and accurate pipeline outputs (the metadata tier never writes
here — it lives in files).
- Cache key (
extractions):content_hash × profile × model × mode × overrides - Same file with different modes/profiles/overrides → separate cache entries
--no-cache: bypasses read, evicts existing entry, forces fresh run (applies to fast/accurate for image/video/audio/PDF; the metadata tier is always read fromfiles, andkind=text+ non-PDF documents ignore--no-cachesince their content is deterministic)- Cache hit indicator: footer shows
cached • 36ms • 412.8 KB • 7.0 MB/s - Embedding: on cache miss with accurate mode,
embed_file_chunksauto-generates Gemini embeddings
Verbose (--verbose/ -v)¶
Pipeline execution tree shown after content:
- Encode-only pipelines (audio fast, document fast): single
└─node - Encode + generate:
├─encode,└─generate - Generate line:
profile_name • elapsed • prompt→completion tokens
Pipeline Customization¶
Built-in pipelines¶
pipelines/
image/ fast.yaml accurate.yaml
video/ fast.yaml accurate.yaml
audio/ fast.yaml accurate.yaml
document/ fast.yaml accurate.yaml
Override mechanisms (priority order)¶
-p pipeline.yaml— explicit YAML file-p encoder_name— named encoder (e.g.tile,mosaic,page-text)~/.config/mm/pipelines/{kind}/{mode}.yaml— user override directory- Built-in
pipelines/{kind}/{mode}.yaml
Pipeline YAML structure¶
kind: image
mode: fast
encode:
strategy: resize # registered encoder name
strategy_opts:
max_width: 512 # encoder-specific options
generate: # optional — omit for encode-only
prompt: "Describe..." # supports {filename}, {content}, {transcript}
max_tokens: 256
temperature: 0.1 # optional
json_mode: false # optional
CLI overrides¶
Namespaced flags override individual pipeline fields:
--encode.strategy resize— swap encoder--encode.strategy_opts max_width=768— override a singlestrategy_optsentry (repeatable; values are coerced to int/float/bool when possible, e.g.--encode.strategy_opts max_width=768 --encode.strategy_opts fps=5)--encode.pyfunc transform.py— custom Python transform--generate.prompt "..."— override prompt--generate.max-tokens 512— override token limit--generate.temperature 0.5— override temperature--print-pipeline image/accurate— print the YAML source of a built-in pipeline (accepts<kind>/<mode>, useful as a starting point for a custom pipeline)
Output Formats¶
- TTY (default): Rich-formatted with syntax highlighting for code files
- Piped (default): plain text, no ANSI codes
--format json:{"path", "mode", "content"}--format dataset-jsonl: one JSON object per line with metadata--format dataset-hf: HuggingFace-compatible dataset export (requires--output-dir)- Multi-file separator:
--- path (kind, sizeB) ---
Footer¶
Always shown (dimmed):
Examples: 836ms • 38.2 KB • 45.7 KB/s, cached • 36ms • 412.8 KB • 7.0 MB/s
- Throughput auto-scales: B/s → KB/s → MB/s → GB/s
cachedprefix when served from the extractions cache
Introspection¶
--list-pipelines: show all built-in and user-override pipeline YAML files--list-encoders: show all registered encoder strategies with parameters