Pipelines¶
Pipelines configure a 2-stage flow for LLM-based media understanding: encode (via an encoder) then generate (LLM call) to produce text output.
%%{init: {'look': 'neo'} }%%
graph LR
file("Media File")
subgraph encode ["Encode"]
encoder("Encoder")
end
subgraph message ["Message"]
msg("Messages")
end
subgraph generate ["Generate"]
llm("MLLM")
end
out("stdout")
file --> encoder --> msg --> llm --> out
style encode rx:10px,ry:10px
style message rx:10px,ry:10px
style generate rx:10px,ry:10px
Each pipeline is a YAML file under pipelines/{kind}/{mode}.yaml that references
an encoder from mm/encoders/ and configures LLM generation parameters.
Pipeline scope. Pipelines apply to image / video / audio / PDF — the binary media kinds.
--mode fast(default) runs the fast pipeline;--mode accurateruns the LLM-heavy one.kind=textand non-PDF documents (.docx/.pptx) follow a passthrough flow regardless of mode, so-p,--encode.*, and--generate.*are no-ops for those.
mm cat photo.jpg -m fast -p resize # named encoder (fast pipeline)
mm cat video.mp4 -m fast -p shot-mosaic # scene-aware video encoder
# Override pipeline config from CLI
mm cat photo.jpg -m accurate --encode.strategy tile
mm cat photo.jpg -m accurate --generate.max-tokens 1024 --generate.temperature 0.5
mm cat photo.jpg -m accurate --prompt "Describe in one sentence." --model moondream2
# Override individual strategy_opts entries (repeatable, KEY=VALUE form)
mm cat photo.jpg -m accurate --encode.strategy_opts max_width=768
mm cat video.mp4 -m accurate --encode.strategy_opts max_width=768 --encode.strategy_opts fps=5
# Inspect the YAML source of a built-in pipeline (use as a template for your own)
mm cat --print-pipeline image/accurate
# Load explicit pipeline YAML (repeatable, dispatched by kind)
mm cat photo.jpg -m accurate -p ~/my-image-pipeline.yaml
mm cat *.jpg *.mp4 -m accurate -p image.yaml -p video.yaml
# Custom Python transform via pyfunc
mm cat photo.jpg -m accurate --encode.pyfunc ~/my_filter.py
# Per-call model + extra_body for provider-specific knobs
# (e.g. vlmrt's method/method_params/video_fps/image_resolution).
mm --profile vlmrt cat photo.jpg -m accurate \
--model moondream2 \
--generate.extra-body '{"method":"detect","method_params":{"object":"fish"}}'
generate.model — pinning a model per pipeline¶
The generate block accepts an optional model: string that overrides
the active profile's default model whenever this pipeline is used.
Leaving it unset (or null) means "use the profile model". Useful for
shipping a pipeline that always targets a specific deployment-side model
(e.g. an OCR pipeline that always wants paddleocr-v5).
# ~/.config/mm/pipelines/image/accurate.yaml
kind: image
mode: accurate
generate:
prompt: Read every line of text on this image.
model: paddleocr-v5
extra_body:
method: ocr
CLI --model / --generate.model always wins over a pipeline-pinned
model.
generate.extra_body — provider-specific knobs¶
The generate block accepts an arbitrary extra_body: mapping that is
forwarded verbatim to the OpenAI SDK's extra_body= argument. Use it when
your endpoint needs request fields beyond the standard OpenAI surface
(temperature/max_tokens/json_mode are still first-class fields).
# ~/.config/mm/pipelines/image/accurate.yaml
kind: image
mode: accurate
encode:
strategy: resize
strategy_opts: { max_width: 1024 }
generate:
prompt: Describe this image.
max_tokens: 512
extra_body:
method: caption
method_params:
length: normal
image_resolution: "448x448"
Anything passed via mm cat --generate.extra-body '<json>' deep-merges
on top of the pipeline-level extra_body, so per-call CLI flags can
override individual keys without discarding the YAML defaults. The
combined extra_body (along with the resolved model) is included in
the L2 cache key so cached results are invalidated when knobs change.
Override surfaces — full precedence rules¶
For every cat invocation the effective LLM call comes from four layers
(right-most wins on conflict):
profile (mm.toml) -> pipeline YAML (generate.*) -> encoder generate[mode] -> CLI flags on `cat`
base_url prompt prompt (if set) --prompt / --generate.prompt
api_key model model (if set) --model / --generate.model
model (default) max_tokens max_tokens (if set) --generate.max-tokens
temperature --generate.temperature
json_mode --generate.json-mode
extra_body --generate.extra-body
(deep-merged onto YAML)
Encoder generate override: Built-in encoders can declare a generate class variable that maps modes to Generate overrides. This layer sits between the pipeline YAML and CLI flags. Setting a mode to None suppresses the LLM call entirely (useful for transcript-only encoders like transcribe). CLI flags still win over encoder-level overrides.
base_url and api_key are profile-only — they have no pipeline or CLI
override.
Example my_filter.py¶
A pyfunc file must define transform(parts, context) -> list[dict].
parts is a list of OpenAI-compatible message content dicts (e.g.
{"type": "text", ...} or {"type": "image_url", ...}); context is
file metadata (name, kind, size, etc.).
# ~/my_filter.py — keep only image parts and prepend a custom instruction
def transform(parts: list[dict], context: dict) -> list[dict]:
images = [p for p in parts if p.get("type") == "image_url"]
header = {"type": "text", "text": f"Analyze {context['name']} in detail."}
return [header, *images]
Inline variants also work inside a pipeline YAML:
encode:
strategy: resize
pyfunc: |
def transform(parts, context):
return [p for p in parts if p.get("type") == "image_url"]
Dry run¶
Use --dry-run to inspect the fully-resolved pipeline for a file without executing it. Shows the encoder, strategy options, model, and prompt that would be used:
mm cat photo.png --dry-run
mm cat video.mp4 -m accurate --dry-run
mm cat audio.mp3 -m accurate --encode.backend mlx --dry-run
Output is a dim preview block rendered to the terminal — no file reads, no LLM calls, no cache writes.
Encoders¶
See encoders.md for the full encoder reference — all built-in encoders, parameters, planned encoders, and how to write custom encoders.