An MPI-friendly transcription workflow designed for digital librarians, archivists, curators, and developers who manage large audiovisual collections. The toolkit builds on https://github.com/BreuerLabs/AI-SummarizeVid
- Parallel transcription: Scan large audio/video collections (optional recursion) and distribute work across MPI ranks.
- Optional audio normalization keeps legacy formats consistent for Whisper ingestion.
- Parallel transcription outputs
txt,json,tsv, orsrt. - Keyframe extraction + GPT-Vision produces storyboard descriptions of visual content.
- Summaries, tags, & accessibility notes provide layered AI metadata (text + visuals + narration cues).
- Collection analytics: generate research briefings, quality reports, clustering insights, and IIIF manifests.
- Discovery outputs: export catalog-ready CSV, SQLite FTS index, HTML preview dashboard.
- Workflow automation:
run_pipeline.pysequences steps with provenance logs;configure_tool.pybootstraps configs interactively.
-
Install prerequisites
brew install ffmpeg open-mpi # macOS example python3 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt
-
Prepare a configuration
cp config-example.yaml config.yaml # edit config.yaml so the paths match your collection and desired outputs -
Launch a transcription run
mpirun -np 8 python transcribe_collection.py --config config.yaml
-
(Optional) Extract keyframes
mpirun -np 8 python extract_keyframes.py --config config.yaml
-
(Optional) Describe frames with GPT-Vision
mpirun -np 8 python describe_frames.py --config config.yaml
-
(Optional) Launch summarization
mpirun -np 8 python summarize_collection.py --config config.yaml
Make sure your OpenAI API key is set (see
summarization.api_key_envin the config). -
Layer in metadata, discovery, and exports
mpirun -np 8 python generate_tags.py --config config.yaml mpirun -np 8 python generate_accessibility.py --config config.yaml python collection_report.py --config config.yaml python build_preview.py --config config.yaml python quality_metrics.py --config config.yaml python build_iiif_manifest.py --config config.yaml python export_catalog.py --config config.yaml python build_search_index.py --config config.yaml python cluster_visuals.py --config config.yaml
-
Orchestrate everything
python run_pipeline.py --config config.yaml
The YAML file controls several areas:
| Section | Purpose |
|---|---|
input |
Where to find media, which extensions to include, and whether to search subdirectories. |
preprocessing |
Optional audio extraction/normalization details. Disable if your media is already Whisper-ready. |
transcription |
Whisper model choice, device, language hints, and decoding parameters. |
outputs |
Top-level folder, output formats, and writer options shared across formats. |
keyframes |
Optional still-frame extraction settings (modes, intervals, segment parsing, output layout). |
frame_descriptions |
Optional GPT-Vision settings to describe frames, with transcript and metadata context. |
summarization |
GPT summarization settings, including use of frame descriptions. |
tagging |
GPT-based entity and topic tagging of transcripts. |
collection_reports |
Collection-wide synthesis reports. |
accessibility |
Audio-description narration cues based on transcripts/frames. |
preview_dashboard |
Static HTML preview builder. |
quality_control |
Heuristic metrics and flags for QA triage. |
iiif |
IIIF manifest generation parameters. |
catalog_export |
CSV export options for library systems. |
search_index |
SQLite full-text index settings. |
clustering |
Visual theme clustering (text embeddings + k-means). |
workflow |
Ordered pipeline steps for run_pipeline.py. |
logging |
Verbosity controls and how often to print progress. |
See config-example.yaml for inline documentation of each field.
The repository includes config-gold-standard.yaml, a preset that mirrors the original four-stage AI-SummarizeVid workflow (Whisper transcripts, speech + interval keyframes, GPT-Vision descriptions, and 50-word GPT summaries). To use it:
- Copy the file to
config.yaml(or pass it directly via--config), then editinput.media_rootso it points at your collection. Updatemetadata_csventries if you want metadata-aware prompts. - Set
OPENAI_API_KEYin your environment before running frame-description or summarization steps. - Launch the end-to-end run:
mpirun -np 8 python run_pipeline.py --config config-gold-standard.yaml
The preset writes transcripts, keyframes, GPT frame descriptions, and GPT summaries into outputs/transcripts_gold, outputs/keyframes_gold, outputs/frame_descriptions_gold, and outputs/summaries_gold, and it enforces 3-second interval sampling (capped at 60 frames) with the published prompt language to ensure behavioral parity.
ffmpeg preprocessing is helpful when collections contain a patchwork of legacy formats—normalizing sample rate, channel layout, and codecs improves transcription consistency and avoids Whisper’s fallback re-encoding. If your collection is already stored as modern MP4/MKV with AAC stereo audio, you can disable preprocessing to skip that extra I/O.
- Enable the
keyframessection to sample stills at regular intervals, per speech segment, or both. Outputs live underkeyframes/<mode>/.... - Turn on
frame_descriptionsto send each still to a vision-capable GPT model (e.g.,gpt-4o). Prompts can include transcript excerpts and metadata for richer, neutral descriptions. - Descriptions mirror the keyframe directory tree, allowing easy correlation between images and text.
- Configure
summarizationto feed transcripts (and optionally frame descriptions) into concise GPT outputs. - Enable
tagging,collection_reports,accessibility, andquality_controlto produce structured metadata, narrations, and QA dashboards. build_preview.pyassembles a static HTML gallery (with optional custom CSS) for quick inspection.build_iiif_manifest.pyandexport_catalog.pyprepare assets for IIIF viewers and standard catalog systems.build_search_index.pycreates a SQLite FTS database that you can query with SQL or wrap in a simple API.cluster_visuals.pyuses OpenAI embeddings + k-means to group similar frame descriptions, helping surface recurring visuals.run_pipeline.pyties it all together with provenance logging;configure_tool.pylets new users bootstrap configs interactively.
The script creates (and reuses) subdirectories inside outputs.base_dir, one per requested format:
transcripts/
txt/
srt/
json/
tsv/
Filenames mirror the relative path of each media asset with directory separators replaced by __. This keeps outputs unique, even when the source collection contains identical filenames in different folders.
- MPI scaling is roughly linear up to saturated disk or network throughput; use
np.array_splitacross ranks to balance workloads. - Use
mpirun -np <N> ...on a single machine for light collections or distribute across cluster nodes if your environment provides a shared filesystem. - Whisper models are GPU-accelerated when
deviceis set tocudaand a compatible GPU is available; otherwise they run on CPU.
- Swap Whisper model sizes (
base,small,medium,large-v3) in the config to balance quality and runtime. - Feed the generated transcripts into your own discovery interfaces or cataloging systems.
- Enable the gold-standard preset when you want the full storyboard + summary flow from the published AI-SummarizeVid workflow.