Skip to content

mm sql

Run SQL queries against file metadata, LLM extraction outputs, and chunked content — without leaving the terminal.

Synopsis

mm sql "QUERY" [OPTIONS]
mm sql --list-tables

Options

Flag Short Type Description
--dir DIR -d path Directory to scan for the files table (default: .)
--format FORMAT -f enum Output format: rich (default), json, tsv, csv, dataset-jsonl, dataset-hf
--list-tables flag List available tables and their row counts
--pre-index flag Index unindexed files into the persistent store before querying files

Tables

mm sql auto-routes queries to the correct backend based on the FROM clause. Three tables are available:

Table Source Persistence
files Directory scan + SQLite persistent store Indexed entries survive across sessions; unindexed files show a diff
extractions SQLite (~/.local/share/mm/mm.db) Persistent — written by mm cat
chunks SQLite (~/.local/share/mm/mm.db) Persistent — written by mm cat

Queries referencing extractions or chunks in the FROM or JOIN clause go to persistent SQLite directly. All other queries hit the files table.

# see what's in each table
mm sql --list-tables

files table

The files table combines a live directory scan with the persistent metadata index. Rows from the SQLite store are loaded into an in-memory SQLite table and the user's query runs over them.

If files exist on disk but have not been indexed yet, mm sql shows a diff of unindexed files to stderr and prints a hint to re-run with --pre-index.

Columns

Column Type Description
uri TEXT Absolute file path (primary key)
name TEXT Filename with extension
stem TEXT Filename without extension
ext TEXT Extension including dot (.png, .pdf)
size INTEGER File size in bytes
modified REAL Last modification timestamp (Unix epoch)
created REAL Creation timestamp (Unix epoch)
mime TEXT MIME type inferred from extension
kind TEXT Semantic kind: image, video, audio, document, code, data, config, text, other
is_binary INTEGER 1 if file is binary content
depth INTEGER Directory depth from scan root (0 = top-level)
parent TEXT Parent directory name
width INTEGER Pixel width (images / video). Null for non-media.
height INTEGER Pixel height. Null for non-media.
content_hash TEXT xxh3 content hash (hex). Null until indexed.
text_preview TEXT Locally-extracted text content. Null until indexed.
line_count INTEGER Line count for text/code files. Null for binary.
word_count INTEGER Word count for text/code files.
language TEXT Detected programming/natural language.
dimensions TEXT WxH string (e.g. 1920x1080).
pages INTEGER Page count for documents.
duration_s REAL Duration in seconds (audio/video).
fps REAL Frames per second (video).
magic_mime TEXT Content-inspected MIME type when it differs from mime.
exif_camera TEXT Camera make and model.
exif_date TEXT Date/time original from EXIF.
exif_gps TEXT GPS coordinates.
exif_orientation TEXT EXIF orientation tag.
video_codec TEXT Video codec (e.g. h264, av1).
audio_codec TEXT Audio codec (e.g. aac, opus).
has_audio INTEGER 1 if video has an audio track.
phash TEXT Perceptual hash (16-digit hex) for images.
indexed_at INTEGER Unix timestamp when the file was last indexed.
content_indexed_at INTEGER Unix timestamp when content extraction ran.

extractions table

Stores the output of mm cat runs — one row per unique (file, profile, model, mode) combination.

Column Type Description
id TEXT Extraction ID (primary key)
file_uri TEXT Absolute path of the source file
content_hash TEXT xxh3 hash of the file at extraction time
profile TEXT Profile name used for the LLM call
model TEXT Model used for extraction
mode TEXT fast or accurate
detail INTEGER Detail flag (reserved)
extra TEXT Override fingerprint (pipeline/flag hash)
summary TEXT The extraction output text
metadata TEXT JSON metadata (e.g. verbose timing suffix)
created_at INTEGER Unix timestamp of when the extraction was stored

chunks table

Stores chunked content for vector search — one row per chunk. modemetadata, fast, accurate.

Column Type Description
id INTEGER Auto-incrementing primary key
extraction_id TEXT Foreign key → extractions.id (nullable for metadata-tier chunks)
file_uri TEXT Absolute path of the source file
content_hash TEXT xxh3 hash of the file
profile TEXT Profile name
model TEXT Model name
mode TEXT metadata, fast, or accurate
chunk_idx INTEGER Chunk sequence index within the file
chunk_text TEXT Chunk content
created_at INTEGER Unix timestamp

Examples

Discovery and inventory

# count files by kind
mm sql "SELECT kind, COUNT(*) as n FROM files GROUP BY kind ORDER BY n DESC"

# file count and total size by extension
mm sql "SELECT ext, COUNT(*) as n, SUM(size) as total_bytes FROM files GROUP BY ext ORDER BY total_bytes DESC"

# total storage used per kind
mm sql "SELECT kind, SUM(size) as bytes, COUNT(*) as files FROM files GROUP BY kind ORDER BY bytes DESC"

# top-level directory breakdown (depth=0)
mm sql "SELECT parent, COUNT(*) as files, SUM(size) as bytes FROM files WHERE depth=0 GROUP BY parent"

# files modified in the last 7 days (epoch: now - 7*86400)
mm sql "SELECT name, kind, modified FROM files WHERE modified > strftime('%s','now','-7 days') ORDER BY modified DESC"

# duplicate detection: files sharing a content_hash
mm sql "SELECT content_hash, COUNT(*) as n, GROUP_CONCAT(name, ', ') as names FROM files WHERE content_hash IS NOT NULL GROUP BY content_hash HAVING n > 1"

# largest 20 files across all kinds
mm sql "SELECT name, kind, size FROM files ORDER BY size DESC LIMIT 20"

# all files in a specific directory
mm sql "SELECT name, kind, size FROM files WHERE kind='image'" --dir ~/photos

Image queries

# largest images
mm sql "SELECT name, size FROM files WHERE kind='image' ORDER BY size DESC LIMIT 10"

# images with GPS coordinates
mm sql "SELECT name, exif_gps, exif_date FROM files WHERE exif_gps IS NOT NULL"

# images from a specific camera
mm sql "SELECT name, exif_camera, exif_date FROM files WHERE exif_camera LIKE '%iPhone%'"

# images taken on a specific date
mm sql "SELECT name, exif_date FROM files WHERE exif_date LIKE '2024-03%' ORDER BY exif_date"

# images by resolution (widest first)
mm sql "SELECT name, width, height, width*height as pixels FROM files WHERE kind='image' ORDER BY pixels DESC LIMIT 10"

# images missing EXIF data
mm sql "SELECT name FROM files WHERE kind='image' AND exif_date IS NULL"

# potentially misnamed files (magic_mime differs from mime)
mm sql "SELECT name, mime, magic_mime FROM files WHERE magic_mime IS NOT NULL AND magic_mime != mime"

# images by perceptual hash (for duplicate grouping)
mm sql "SELECT phash, COUNT(*) as n, GROUP_CONCAT(name, ', ') as names FROM files WHERE phash IS NOT NULL GROUP BY phash HAVING n > 1"

Video and audio queries

# all videos with duration and codec
mm sql "SELECT name, duration_s, fps, video_codec, audio_codec FROM files WHERE kind='video'"

# longest videos
mm sql "SELECT name, ROUND(duration_s/60, 1) as minutes, video_codec FROM files WHERE kind='video' ORDER BY duration_s DESC"

# videos without audio track
mm sql "SELECT name, duration_s FROM files WHERE kind='video' AND has_audio=0"

# videos by codec
mm sql "SELECT video_codec, COUNT(*) as n FROM files WHERE kind='video' GROUP BY video_codec"

# audio files longer than 5 minutes
mm sql "SELECT name, ROUND(duration_s/60, 1) as minutes FROM files WHERE kind='audio' AND duration_s > 300 ORDER BY duration_s DESC"

# total media duration
mm sql "SELECT kind, ROUND(SUM(duration_s)/3600, 2) as hours FROM files WHERE kind IN ('video','audio') GROUP BY kind"

Document queries

# PDFs sorted by page count
mm sql "SELECT name, pages, size FROM files WHERE ext='.pdf' ORDER BY pages DESC"

# documents without extracted text (unindexed)
mm sql "SELECT name, kind FROM files WHERE kind='document' AND text_preview IS NULL"

# average page count by document kind
mm sql "SELECT ext, AVG(pages) as avg_pages, COUNT(*) as n FROM files WHERE pages IS NOT NULL GROUP BY ext"

Code and text queries

# code files by language
mm sql "SELECT language, COUNT(*) as files, SUM(line_count) as total_lines FROM files WHERE kind='code' GROUP BY language ORDER BY total_lines DESC"

# largest source files
mm sql "SELECT name, language, line_count, word_count FROM files WHERE kind='code' ORDER BY line_count DESC LIMIT 20"

# files with unusually high line counts (> 1000 lines)
mm sql "SELECT name, line_count FROM files WHERE line_count > 1000 ORDER BY line_count DESC"

# word count across all text files
mm sql "SELECT SUM(word_count) as total_words, COUNT(*) as files FROM files WHERE kind IN ('text','code')"

extractions table

# all cached extractions
mm sql "SELECT file_uri, mode, model, SUBSTR(summary,1,120) as preview FROM extractions LIMIT 20"

# extractions for accurate mode only
mm sql "SELECT file_uri, model, summary FROM extractions WHERE mode='accurate'"

# extractions by profile and model
mm sql "SELECT profile, model, COUNT(*) as n FROM extractions GROUP BY profile, model ORDER BY n DESC"

# most recently cached files
mm sql "SELECT file_uri, mode, datetime(created_at,'unixepoch') as ts FROM extractions ORDER BY created_at DESC LIMIT 10"

# full-text search in extraction summaries
mm sql "SELECT file_uri, SUBSTR(summary,1,200) FROM extractions WHERE summary LIKE '%attention mechanism%'"

# extraction size distribution
mm sql "SELECT mode, AVG(LENGTH(summary)) as avg_chars, MAX(LENGTH(summary)) as max_chars FROM extractions GROUP BY mode"

# files extracted more than once (multiple modes or models)
mm sql "SELECT file_uri, COUNT(*) as n FROM extractions GROUP BY file_uri HAVING n > 1 ORDER BY n DESC"

# cache coverage: how many indexed files have an extraction
mm sql "SELECT e.mode, COUNT(DISTINCT e.file_uri) as extracted FROM extractions e GROUP BY e.mode"

chunks table

# chunks by content tier
mm sql "SELECT mode, COUNT(*) as n, AVG(LENGTH(chunk_text)) as avg_chars FROM chunks GROUP BY mode"

# chunks for a specific file
mm sql "SELECT chunk_idx, SUBSTR(chunk_text,1,200) FROM chunks WHERE file_uri LIKE '%paper.pdf%' ORDER BY chunk_idx"

# files with the most chunks
mm sql "SELECT file_uri, COUNT(*) as chunks FROM chunks GROUP BY file_uri ORDER BY chunks DESC LIMIT 10"

# long chunks (may indicate unsplit content)
mm sql "SELECT file_uri, chunk_idx, LENGTH(chunk_text) as chars FROM chunks WHERE LENGTH(chunk_text) > 2000 ORDER BY chars DESC"

# indexed coverage: distinct files in chunks vs extractions
mm sql "SELECT 'chunks' as tbl, COUNT(DISTINCT file_uri) as files FROM chunks UNION ALL SELECT 'extractions', COUNT(DISTINCT file_uri) FROM extractions"

Cross-table queries

# files that have been extracted
mm sql "SELECT f.name, f.kind, e.mode, e.model FROM files f JOIN extractions e ON f.uri = e.file_uri"

# files not yet extracted (no entry in extractions)
mm sql "SELECT f.name, f.kind FROM files f LEFT JOIN extractions e ON f.uri = e.file_uri WHERE e.file_uri IS NULL"

# extraction summary alongside file size
mm sql "SELECT f.name, f.size, e.mode, SUBSTR(e.summary,1,100) as preview FROM files f JOIN extractions e ON f.uri = e.file_uri ORDER BY f.size DESC LIMIT 10"

# chunk count per file plus its kind
mm sql "SELECT f.name, f.kind, COUNT(c.id) as chunks FROM files f JOIN chunks c ON f.uri = c.file_uri GROUP BY f.uri ORDER BY chunks DESC"

Indexing and schema introspection

# index files first, then query
mm sql "SELECT kind, COUNT(*) as n FROM files GROUP BY kind" --dir ~/data --pre-index

# inspect available PRAGMA information
mm sql "PRAGMA table_info(files)"
mm sql "PRAGMA table_info(extractions)"
mm sql "PRAGMA table_info(chunks)"

# list tables and row counts
mm sql --list-tables

Machine-readable output

# JSON
mm sql "SELECT name, size FROM files WHERE kind='image'" --format json

# TSV for scripting
mm sql "SELECT kind, COUNT(*) as n FROM files GROUP BY kind" --format tsv

# CSV for spreadsheet import
mm sql "SELECT name, kind, size FROM files ORDER BY size DESC" --format csv

# pipe into jq
mm sql "SELECT name, size FROM files ORDER BY size DESC LIMIT 5" --format json | jq '.[] | .name'

# pipe into awk
mm sql "SELECT name, size FROM files WHERE kind='video'" --format tsv | awk -F'\t' 'NR>1 {sum+=$2} END {print sum " bytes"}'

Table routing

mm sql inspects the query's FROM and JOIN clauses to determine routing:

  • If any clause references extractions, chunks, or chunks_vec → persistent SQLite direct
  • PRAGMA queries → persistent SQLite direct
  • Everything else → scanned files table (in-memory SQLite over the persistent index)

Unindexed files diff

When querying the files table without --pre-index, mm sql shows a diff of files that exist on disk but have not been indexed yet. Up to 5 unindexed paths are shown; the remainder are summarized with a count. A re-run hint is printed to stdout:

To include these files, run:
  mm sql "SELECT ..." --dir ~/data --pre-index

Notes

  • The files table in mm sql is sourced from the persistent store (SQLite), not a fresh live scan. Files are indexed into the persistent store by mm cat, mm peek, or mm sql --pre-index. Use mm find for a live scan with no dependency on the index.
  • Rich output trims columns to a readable subset (name, kind, ext, size, parent, mime, width, height, modified, depth) when the full column set would be too wide. Use --format tsv or --format json to see all columns.
  • extractions.summary is the full LLM-generated text. It can be long — use SUBSTR(summary, 1, 200) to truncate in queries.
  • chunks.mode = 'metadata' rows contain locally-extracted metadata (text preview, EXIF, etc.) without any LLM call. fast and accurate chunks correspond to pipeline extraction outputs.