mm Spec¶
Fast, multimodal context for agents. Rust core, Python API, Unix CLI.
Legend: [x] implemented, [ ] roadmap, [~] partial/stubbed
mm
├── Metadata scan (Rust core, ~0.02ms/file)
│ ├── [x] Parallel directory walk (ignore crate + per-thread batching, zero lock contention)
│ ├── [x] Gitignore-aware (.gitignore, .git/info/exclude, global)
│ ├── [x] 14-column Arrow schema (path, name, stem, ext, size, modified, created, mime, kind, is_binary, depth, parent, width, height)
│ ├── [x] 9 file kind variants (code, image, document, video, audio, data, config, text, other)
│ ├── [x] Extension-based classification (~100+ extensions mapped)
│ ├── [x] MIME inference via mime_guess
│ ├── [x] Binary detection (extension + kind heuristics)
│ ├── [x] Parallel image dimension enrichment (rayon, header-only reads)
│ ├── [x] CompactString for all string fields (SSO, no heap for short paths)
│ ├── [x] Arrow RecordBatch builder with typed column builders
│ ├── [x] Parquet I/O (ZSTD level 3 compression)
│ └── [x] Manifest-based incremental cache (mtime + size staleness check)
│
├── Metadata tier — Local Content Extraction (Rust extractors + Python ffmpeg; no LLM)
│ ├── Code / Text / Config
│ │ ├── [x] Line count, word count
│ │ ├── [x] Text preview (first 500 chars)
│ │ ├── [x] Content hash (xxh3, full file)
│ │ └── [x] Language detection (~30 languages from extension)
│ ├── Images
│ │ ├── [x] Dimensions (WxH, header-only via mmap)
│ │ ├── [x] Content hash (xxh3 via mmap, zero-copy)
│ │ ├── [x] Magic-byte MIME detection (infer crate)
│ │ └── [x] EXIF extraction (camera, date, GPS lat/lon, orientation)
│ ├── Video
│ │ ├── [x] Native MP4/MOV parsing (mp4parse crate, no subprocess)
│ │ ├── [x] Native MKV/WebM parsing (matroska crate, no subprocess)
│ │ ├── [x] Resolution, duration, FPS, codec, audio codec, has_audio
│ │ ├── [x] Content hash (xxh3 via mmap)
│ │ ├── [x] Keyframe mosaic extraction (ffmpeg -skip_frame nokey, single-pass, ~5000x realtime)
│ │ ├── [x] Scene-change mosaic extraction (ffmpeg select='gt(scene,T)', full decode)
│ │ ├── [x] Configurable tile grid (COLSxROWS), thumb width, JPEG quality
│ │ └── [ ] Optical flow / motion summary (mpdecimate)
│ ├── Audio
│ │ ├── [x] Audio extraction at Nx speed (atempo, chained for >2x)
│ │ ├── [x] Mono 16kHz PCM downmix (Whisper-optimized)
│ │ ├── [x] Configurable output format (wav, mp3, flac)
│ │ └── [x] Pure-Rust audio metadata via symphonia (mp3, wav, flac, aac, ogg, opus)
│ ├── Documents (PDF)
│ │ ├── [x] Text extraction via pypdfium2 (Python CLI side)
│ │ ├── [x] Page-by-page extraction
│ │ ├── [x] PDF page mosaic grids (pypdfium2 render → Pillow tile, ~10ms/page)
│ │ ├── [x] Configurable tile grid, thumbnail width, max pages, JPEG quality
│ │ └── [x] Rust-side content hash for documents (xxh3 via DocumentExtractor)
│ └── Hashing
│ ├── [x] fast_fingerprint — partial hash (first+last 64KB + size), ~33x faster on 10MB
│ ├── [x] full_hash_mmap — full xxh3 via mmap, zero-copy
│ ├── [x] full_hash_read — streaming fallback for special files
│ ├── [x] content_hash() — direct Python-callable xxh3 hash (no Scanner overhead)
│ └── [x] perceptual_hash() — direct Python-callable pHash for images
│
├── Accurate mode (LLM pipelines)
│ ├── [x] OpenAI Python SDK (openai>=1.0, any compatible API)
│ ├── [x] Image captioning (base64 vision API)
│ ├── [x] File content description
│ ├── [x] Video understanding via keyframe mosaic + LLM (auto in accurate mode)
│ ├── [x] Audio description via metadata + LLM (auto in accurate mode)
│ ├── [x] Configurable via profiles: built-in default + reserved ollama + custom add/use
│ ├── [x] think=false + reasoning_effort="none" + temperature=0.1
│ ├── [x] Accurate-mode errors propagate directly (no silent fallback to fast mode)
│ ├── [x] --mode fast|accurate per-modality extraction strategies (default 'fast')
│ ├── [x] Audio transcription via ffmpeg + modular backends (2x speed, greedy beam=1)
│ ├── [x] Transcription backend ABC with pluggable registry (mm.common.audio)
│ ├── [x] Backend auto-select: MLX Metal GPU (10) > CTranslate2 CPU/CUDA (20) > OpenAI API (30)
│ ├── [x] OpenAI-compatible transcription backend (/v1/audio/transcriptions, custom base_url)
│ ├── [x] TranscriptionConfig in mm.toml (backend, base_url, api_key)
│ ├── [x] Parallel visual + audio extraction (ThreadPoolExecutor)
│ ├── [x] Video: mosaic (4x4 @ 1500px) + transcript → LLM markdown
│ ├── [x] Image: fast (10 words + 5 tags) / accurate (200 words + 10 tags + objects)
│ ├── [x] Document extraction via pypdfium2 (PDF) / libreoffice-rs (DOCX/PPTX/PPTX/ODS/ODT/ODP)
│ └── [x] Embedding generation via Gemini API (text, image, audio, video, document → chunks_vec)
│
├── Python API (Context class)
│ ├── [x] Metadata scan on construction (~5ms for 249 real files)
│ ├── [x] to_polars() — zero-copy Arrow → Polars
│ ├── [x] to_pandas() — Arrow → Pandas
│ ├── [x] to_arrow() — raw PyArrow Table
│ ├── [x] sql(query) — SQLite SQL against 'files' table
│ ├── [x] filter(kind, ext, min_size, max_size) — chainable, returns new Context
│ ├── [x] cat(path) / head(path, n) / tail(path, n)
│ ├── [x] grep(pattern, kind) — regex search across file contents
│ ├── [x] show(limit, columns) — Rich table display
│ ├── [x] info() — Rich summary panel
│ ├── [x] save() — persist to .mm/index.parquet
│ ├── [x] Context(session_id=...) / Context.new_session() — external session id
│ ├── [x] add(str | Path | PIL.Image.Image, role) — role-aware refs; strings inline as text
│ ├── [x] ref_for(path) / global_ref(path) / refs — kind-prefixed deterministic ref ids
│ └── [x] Context.resolve("<session_id>/<ref_id>") — global cross-user lookup
│
├── CLI Commands (9 total: 6 core + bench + config + profile. Typer, Unix-philosophy composability)
│ ├── [x] --version/-v global flag
│ ├── [x] find — find/list files, tree view (--tree), schema (--schema), columns (--columns), name filter (--name, string/regex via Rust; -i/--ignore-case for case-insensitive)
│ ├── [x] peek — raw file metadata (dimensions / EXIF / codec / mime / hash).
│ ├── [x] [cat](./cat.md) — auto-detected content extraction (fast/accurate mode) → [full spec](cat.md)
│ │ ├── [x] head/tail via -n (replaces old head/tail commands)
│ │ ├── [x] --mode fast|accurate (default 'fast');
│ │ ├── [x] video accurate: parallel mosaic + whisper → LLM (102x realtime)
│ │ ├── [x] audio accurate: ffmpeg 2x + whisper → LLM transcript summary
│ │ ├── [x] image accurate: fast (10w+5tags) / accurate (200w+10tags+objects)
│ │ ├── [x] document accurate: pypdfium2 PDF → text → LLM
│ │ ├── [x] --encode.*, --generate.* namespaced flags
│ │ ├── [x] --no-cache flag bypasses extractions cache (both fast and accurate modes)
│ │ ├── [x] unified extractions caching for both fast and accurate modes
│ │ ├── [x] verbose pipeline tree (-v): encode/generate timing + token counts
│ │ ├── [x] -p pipeline.yaml / -p encoder_name for custom pipelines
│ ├── [x] grep — content search with context lines (like rg), --pre-index for on-demand semantic indexing
│ ├── [x] sql — SQLite SQL on file index, --pre-index for on-demand metadata indexing before query
│ ├── [x] wc — count files, size, lines (est.), tokens (est.)
│ ├── [x] config — extraction mode settings (show, init, set, reset-db, reset-profiles, reset)
│ ├── [x] profile — LLM profile management (list, add, update, use, remove; 3 reserved: default, ollama, gemini)
│ ├── [x] bench — 24-command benchmark suite (metadata×10, fast×8, accurate×6) with bits/s throughput
│ └── [ ] context — LLM-ready context payload builder (token budgeting)
│
├── Output Modes
│ ├── [x] TTY stdout → Rich formatted tables/panels with color
│ ├── [x] Piped stdout → plain TSV/text (machine-readable, no ANSI)
│ ├── [x] --format=json flag → JSON on any command
│ ├── [x] Piped stdin → read newline-delimited paths (composability)
│ ├── [x] Pipe detection via isatty() (no select.select() — block-reads when stdin is not a TTY)
│ └── [x] SIGPIPE handling (no BrokenPipeError when piping to head/tail)
│
├── Data Transfer (Rust → Python)
│ ├── [x] Arrow IPC serialization (RecordBatch → bytes → pyarrow.ipc.open_stream)
│ ├── [x] Rust-native JSON (serde_json, bypasses Arrow+pyarrow for --format=json paths)
│ ├── [x] Rust-native filtered/sorted output (kind, ext, size, sort, limit — all in Rust)
│ ├── [x] Zero-copy to Polars (polars.from_arrow)
│ ├── [x] SQLite + sqlite-vec for storage and vector search
│ └── [~] PyCapsule FFI (abandoned — compatibility issues with pyarrow)
│
├── Performance
│ ├── Metadata walk: ~5ms / 1K files, ~16ms / 10K files
│ ├── Metadata full pipeline: ~7ms / 1K mixed files (with image dims)
│ ├── Metadata real data: ~5ms / 249 files (~/data/1-demo)
│ ├── CLI cold start: ~58ms (find --format json via Rust fast path)
│ ├── CLI cold start: ~66ms (find with Rich TTY output)
│ ├── Fast code extraction: ~8μs/file
│ ├── Fast image extraction: ~18μs/file (mmap)
│ ├── Fast video metadata (native): ~10ms (6.4MB MP4, includes hash)
│ ├── Partial hash 10MB: ~19μs (vs 610μs full, 33x speedup)
│ ├── Keyframe mosaic (86min video): ~820ms → 5 mosaic grids
│ ├── PDF page mosaic (68 pages): ~280ms → 2 mosaic grids
│ ├── Audio 2x (163s video): ~200ms → 2.5MB Whisper-ready WAV
│ ├── Accurate video (17min, fast mode): ~9.9s total = 102x realtime
│ │ ├── Visual: 16 frames + 4x4 mosaic @ 1500px — 375ms (parallel)
│ │ ├── Audio: ffmpeg 2x + whisper tiny MLX Metal — 3.0s (parallel)
│ │ ├── LLM: qwen3.5:0.8b mosaic+transcript → markdown — 5.2s
│ │ └── Optimization path: beam=5→1 (1.5x), CTranslate2→MLX (5.9x), parallel (6%)
│ ├── Accurate image (fast pipeline): ~1.0s (qwen3.5:0.8b, Ollama local)
│ └── Accurate image (accurate pipeline): ~2.6s (qwen3.5:0.8b, Ollama local)
│
├── Tests
│ ├── Rust: 75 tests (meta, walk, detect, schema, table, code, image, video, audio, document, hash)
│ ├── Python: 587 tests (CLI, Context API, refs/sessions, pipe, metadata/fast/accurate, config, transcription backends, scenes, docling, bench)
│ ├── Criterion benchmarks: metadata_walk, metadata_index, hash_strategies, metadata_extract, find_filter
│ ├── mm bench: 24 commands (metadata×10, fast×8, accurate×6) with bits/s throughput
│ └── pytest-benchmark: 11 benchmarks (metadata, fast, ffmpeg, e2e)
│
└── Build & Tooling
├── [x] Maturin build backend (Rust → Python wheel)
├── [x] PyO3 stable ABI (abi3-py310)
├── [x] uv for all Python operations (venv, pip, run)
├── [x] Makefile targets (develop, test, bench, lint, fmt)
├── [x] Rust edition 2024, stable toolchain + clippy + rustfmt
└── [x] Dev deps: ruff, mypy, pytest, pytest-benchmark, criterion
For each modality (image, video, documents like PDFs), I’d like to have a few different strategies to extract metadata/semantics with varying degrees of detail/mode. For example: - image: - mode=fast -> describe the image in 10 words or less, and extract 5-keyword tags - mode=accurate -> describe the image in detail (200 words) + extract up to 10-keyword tags + extract up to 10-objects/people/faces/logos in the image - documents: (PDFs, Word documents, etc.) - pypdfium2 for PDF text extraction, libreoffice-rs for Office formats - ignore image/video/audio as we have other ways to extract metadata/semantics for them (detailed extraction is not needed) - audio: - mode=fast - Audio transcription collected via ffmpeg + whisper tiny (with audio sped up by 2x) - mode=accurate - Audio transcription collected via ffmpeg + whisper medium (no speed up) - video: - mode=fast - If video is <5min: no shot detection, simply sample 16 keyframes uniformly across the video, create a single image mosaic (4x4) grid as an image and make a request with audio transcription collected via ffmpeg + whisper tiny (audio sped up by 2x) - 1hr: shot detection with pyscenedetect, uniformly sample 16 shots from the entire video, create a single image mosaic (4x4) + ffmpeg and whisper tiny based audio transcription (at 2x speed) - mode=accurate - Shot detection with pyscenedetect, uniformly sample 8*16=128 shots from the entire video, create 8 4x4 mosaic images and make a request with audio transcription collected via ffmpeg + whisper medium - collect (sys info, ffmpeg + GPU)