Pipelines¶
Pipelines configure a 2-stage flow for LLM-based media understanding: encode (via an encoder) then generate (LLM call) to produce text output.
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '14px', 'lineColor': '#666'}}}%%
graph LR
file["Media File"]:::input
subgraph encode ["Encode"]
encoder["Encoder"]:::encode
end
subgraph generate ["Generate"]
llm["MLLM"]:::generate
end
out["stdout"]:::output
file --> encoder --> llm --> out
classDef input fill:#e8f4fd,stroke:#4a90d9,stroke-width:1.5px,color:#1a3a5c,rx:8
classDef encode fill:#e8f5e9,stroke:#4caf50,stroke-width:1.5px,color:#1b5e20,rx:8
classDef generate fill:#fce4ec,stroke:#e57373,stroke-width:1.5px,color:#6a1b1b,rx:8
classDef output fill:#f5f5f5,stroke:#bdbdbd,stroke-width:1.5px,color:#424242,rx:8
style encode fill:#f1f8e9,stroke:#66bb6a,stroke-width:1px,rx:10
style generate fill:#fff0f0,stroke:#ef9a9a,stroke-width:1px,rx:10
Each pipeline is a YAML file under pipelines/{kind}/{mode}.yaml that references
an encoder from mm/encoders/ and configures LLM generation parameters.
Pipeline scope. Pipelines apply to image / video / audio / PDF — the binary media kinds.
--mode fast(default) runs the fast pipeline;--mode accurateruns the LLM-heavy one.kind=textand non-PDF documents (.docx/.pptx) follow a passthrough flow regardless of mode, so-p,--encode.*, and--generate.*are no-ops for those.
mm cat photo.jpg -m fast -p resize # named encoder (fast pipeline)
mm cat video.mp4 -m fast -p shot-mosaic # scene-aware video encoder
# Override pipeline config from CLI
mm cat photo.jpg -m accurate --encode.strategy tile
mm cat photo.jpg -m accurate --generate.max-tokens 1024 --generate.temperature 0.5
mm cat photo.jpg -m accurate --prompt "Describe in one sentence." --model moondream2
# Override individual strategy_opts entries (repeatable, KEY=VALUE form)
mm cat photo.jpg -m accurate --encode.strategy_opts max_width=768
mm cat video.mp4 -m accurate --encode.strategy_opts max_width=768 --encode.strategy_opts fps=5
# Inspect the YAML source of a built-in pipeline (use as a template for your own)
mm cat --print-pipeline image/accurate
# Load explicit pipeline YAML (repeatable, dispatched by kind)
mm cat photo.jpg -m accurate -p ~/my-image-pipeline.yaml
mm cat *.jpg *.mp4 -m accurate -p image.yaml -p video.yaml
# Custom Python transform via pyfunc
mm cat photo.jpg -m accurate --encode.pyfunc ~/my_filter.py
# Per-call model + extra_body for provider-specific knobs
# (e.g. vlmrt's method/method_params/video_fps/image_resolution).
mm --profile vlmrt cat photo.jpg -m accurate \
--model moondream2 \
--generate.extra-body '{"method":"detect","method_params":{"object":"fish"}}'
generate.model — pinning a model per pipeline¶
The generate block accepts an optional model: string that overrides
the active profile's default model whenever this pipeline is used.
Leaving it unset (or null) means "use the profile model". Useful for
shipping a pipeline that always targets a specific deployment-side model
(e.g. an OCR pipeline that always wants paddleocr-v5).
# ~/.config/mm/pipelines/image/accurate.yaml
kind: image
mode: accurate
generate:
prompt: Read every line of text on this image.
model: paddleocr-v5
extra_body:
method: ocr
CLI --model / --generate.model always wins over a pipeline-pinned
model.
generate.extra_body — provider-specific knobs¶
The generate block accepts an arbitrary extra_body: mapping that is
forwarded verbatim to the OpenAI SDK's extra_body= argument. Use it when
your endpoint needs request fields beyond the standard OpenAI surface
(temperature/max_tokens/json_mode are still first-class fields).
# ~/.config/mm/pipelines/image/accurate.yaml
kind: image
mode: accurate
encode:
strategy: resize
strategy_opts: { max_width: 1024 }
generate:
prompt: Describe this image.
max_tokens: 512
extra_body:
method: caption
method_params:
length: normal
image_resolution: "448x448"
Anything passed via mm cat --generate.extra-body '<json>' deep-merges
on top of the pipeline-level extra_body, so per-call CLI flags can
override individual keys without discarding the YAML defaults. The
combined extra_body (along with the resolved model) is included in
the L2 cache key so cached results are invalidated when knobs change.
Override surfaces — full precedence rules¶
For every cat invocation the effective LLM call comes from three layers
(right-most wins on conflict):
profile (mm.toml) -> pipeline YAML (generate.*) -> CLI flags on `cat`
base_url prompt --prompt / --generate.prompt
api_key model --model / --generate.model
model (default) max_tokens --generate.max-tokens
temperature --generate.temperature
json_mode --generate.json-mode
extra_body --generate.extra-body
(deep-merged onto YAML)
base_url and api_key are profile-only — they have no pipeline or CLI
override.
Example my_filter.py¶
A pyfunc file must define transform(parts, context) -> list[dict].
parts is a list of OpenAI-compatible message content dicts (e.g.
{"type": "text", ...} or {"type": "image_url", ...}); context is
file metadata (name, kind, size, etc.).
# ~/my_filter.py — keep only image parts and prepend a custom instruction
def transform(parts: list[dict], context: dict) -> list[dict]:
images = [p for p in parts if p.get("type") == "image_url"]
header = {"type": "text", "text": f"Analyze {context['name']} in detail."}
return [header, *images]
Inline variants also work inside a pipeline YAML:
encode:
strategy: resize
pyfunc: |
def transform(parts, context):
return [p for p in parts if p.get("type") == "image_url"]
Encoders¶
See ENCODERS.md for the full encoder reference — all built-in encoders, parameters, planned encoders, and how to write custom encoders.